Skip to content

Model Zoo

DNALLM includes almost all publicly available DNA Large Language Models and some DNA-based deep learning models. We have adapted these models to work seamlessly with the DNALLM package for fine-tuning and inference.

Model Collection

The following table shows all currently supported models and their fine-tuning/inference capabilities:

Model Name Model Type Architecture Fine-tuning Support Author Model Size Count Source
Plant DNABERT MaskedLM BERT zhangtaolab 100M 1 Molecular Plant
Plant DNAGPT CausalLM GPT2 zhangtaolab 100M 1 Molecular Plant
Plant Nucleotide Transformer MaskedLM ESM zhangtaolab 100M 1 Molecular Plant
Plant DNAGemma CausalLM Gemma zhangtaolab 150M 1 Molecular Plant
Plant DNAMamba CausalLM Mamba zhangtaolab 100M 1 Molecular Plant
Plant DNAModernBert MaskedLM ModernBert zhangtaolab 100M 1 Molecular Plant
Nucleotide Transformer MaskedLM ESM InstaDeepAI 50M / 100M / 250M / 500M / 2.5B 8 Nature Methods
AgroNT MaskedLM ESM InstaDeepAI 1B 1 Current Biology
Caduceus-Ph MaskedLM Caduceus Kuleshov-Group 0.5M / 2M / 8M 3 arXiv
Caduceus-Ps MaskedLM Caduceus Kuleshov-Group 0.5M / 2M / 8M 3 arXiv
PlantCaduceus MaskedLM Caduceus Kuleshov-Group 20M / 40M / 112M / 225M 4 PNAS
PlantCAD2 MaskedLM Caduceus Kuleshov-Group 88M / 311M / 694M 3 bioRxiv
DNABERT MaskedLM BERT Zhihan1996 100M 4 Bioinformatics
DNABERT-2 MaskedLM BERT Zhihan1996 117M 1 arXiv
DNABERT-S MaskedLM BERT Zhihan1996 117M 1 arXiv
EVO-1 CausalLM StripedHyena togethercomputer 6.5B 2 Science
EVO-2 CausalLM StripedHyena2 arcinstitute 1B / 1.5B / 7B / 40B 4 bioRxiv
GENA-LM MaskedLM BERT AIRI-Institute 150M / 500M 7 Nucleic Acids Research
GENA-LM-BigBird MaskedLM BigBird AIRI-Institute 150M 3 Nucleic Acids Research
GENERator CausalLM Llama GenerTeam 1.2B / 3B 2 arXiv
GENERanno CausalLM Generanno GenerTeam 0.5B 2 bioRxiv
GenomeOcean CausalLM Mistral DOEJGI 100M / 500M / 4B 3 bioRxiv
GPN MaskedLM ConvNet songlab 60M 1 PNAS
GROVER MaskedLM BERT PoetschLab 100M 1 Nature Machine Intelligence
HyenaDNA CausalLM HyenaDNA LongSafari 0.5M / 0.7M / 2M / 4M / 15M / 30M / 55M 7 arXiv
LucaOne MaskedLM LucaGPLM LucaGroup 5.6M / 17.6M / 36M 3 Nature Machine Intelligence
JanusDNA MaskedLM JanusDNA Qihao-Duan unknown 6 arXiv
Jamba-DNA CausalLM Jamba RaphaelMourad 114M 1 GitHub
Mistral-DNA CausalLM Mistral RaphaelMourad 1M / 17M / 138M / 417M / 422M 10 [GitHub](https://github.com/raphaelmourad/
ModernBert-DNA MaskedLM ModernBert RaphaelMourad 37M 3 GitHub
megaDNA CausalLM MEGADNA lingxusb 78M / 145M / 277M 3 arXiv
MutBERT MaskedLM RoPEBert JadenLong 86M 3 bioRxiv
OmniNA CausalLM Llama XLS 66M / 220M 2 bioRxiv
Omni-DNA CausalLM OLMoModel zehui127 20M / 60M / 116M / 300M / 700M / 1B 6 arXiv
plant-genomic-jamba CausalLM StripedMamba suzuki-2001 50M 1 GitHub
ProkBERT MaskedLM MegatronBert neuralbioinfo 21M / 25M / 27M 3 Frontiers in Microbiology

Model Categories

By Architecture Type

Masked Language Models (MLM)

  • BERT-based: DNABERT, DNABERT-2, DNABERT-S, Plant DNABERT, GENA-LM, GROVER, MutBERT, ProkBERT, MutBERT, Plant DNAModernBert
  • ESM-based: Nucleotide Transformer, AgroNT, Plant Nucleotide Transformer
  • Caduceus-based: Caduceus-Ph, Caduceus-Ps, PlantCaduceus、PlantCAD2
  • Other: GENA-LM-BigBird, GPN, JanusDNA, LucaOne

Causal Language Models (CLM)

  • Llama-based: GENERator, OmniNA
  • Mistral-based: GenomeOcean, Mistral-DNA
  • Hyena-based: HyenaDNA, EVO-1, EVO-2
  • Other: Jamba-DNA, plant-genomic-jamba, Plant DNAGPT, Plant DNAGemma, Plant DNAMamba, Omni-DNA, megaDNA

By Model Size

Size Category Model Count Examples
Small (<100M) 15 Caduceus-Ph, HyenaDNA variants, ModernBert-DNA
Medium (100M-1B) 18 DNABERT series, Plant models, GENA-LM
Large (1B-10B) 8 Nucleotide Transformer, EVO-1, GENERator
Extra Large (>10B) 3 EVO-2 (40B)

By Source Platform

Platform Model Count Examples
Hugging Face Hub 25+ Most models with direct integration
ModelScope 10+ Alternative source for some models
GitHub 8 Community-contributed models
Academic Journals 15+ Peer-reviewed publications

Usage Guidelines

Fine-tuning Support

  • Native Supported: 35 models with full fine-tuning capabilities based on its own model implementation
  • Custom Supported: 3 models (LucaOne, megaDNA, JanusDNA) with fine-tuning capabilities based on custom implementation for sequence classification
  • Not Supported: 2 models (EVO-1, EVO-2) - inference only

Model Selection Tips

  1. For Classification Tasks: Choose BERT-based models (DNABERT, Plant DNABERT)
  2. For Generation Tasks: Use CausalLM models (Plant DNAGPT, GenomeOcean)
  3. For Large-scale Analysis: Consider Nucleotide Transformer or EVO models
  4. For Plant-specific Tasks: Prefer Plant-prefixed models

Plant Models

The following models are specifically designed for plant genomics:

  • Plant DNABERT: BERT-based model for plant DNA sequence analysis
  • Plant DNAGPT: GPT-based model for plant DNA sequence generation
  • Plant Nucleotide Transformer: ESM-based model for plant genomics
  • Plant DNAGemma: Gemma-based model for plant DNA analysis
  • Plant DNAMamba: Mamba-based model for efficient plant sequence processing
  • Plant DNAModernBert: ModernBert-based model for plant genomics
  • PlantCaduceus: Caduceus-based model for plant sequence analysis

Performance Considerations

  • Small Models (<100M): Fast inference, suitable for real-time applications
  • Medium Models (100M-1B): Good balance of performance and speed
  • Large Models (>1B): Best performance but slower inference

Getting Started

To use any of these models with DNALLM:

from dnallm import load_model_and_tokenizer

# Load a supported model
model, tokenizer = load_model_and_tokenizer(
    "zhangtaolab/plant-dnabert-BPE",
    source="huggingface"
)

# For fine-tuning
from dnallm.finetune import DNATrainer
trainer = DNATrainer(model=model, tokenizer=tokenizer)

Contributing New Models

To add support for new DNA language models:

  1. Ensure the model is publicly available
  2. Test compatibility with DNALLM's architecture
  3. Submit a pull request with integration code
  4. Include proper documentation and examples

For detailed integration instructions, see the Development Guide.