Model Zoo¶
DNALLM includes almost all publicly available DNA Large Language Models and some DNA-based deep learning models. We have adapted these models to work seamlessly with the DNALLM package for fine-tuning and inference.
Model Collection¶
The following table shows all currently supported models and their fine-tuning/inference capabilities:
| Model Name | Model Type | Architecture | Fine-tuning Support | Author | Model Size | Count | Source |
|---|---|---|---|---|---|---|---|
| Plant DNABERT | MaskedLM | BERT | ✅ | zhangtaolab | 100M | 1 | Molecular Plant |
| Plant DNAGPT | CausalLM | GPT2 | ✅ | zhangtaolab | 100M | 1 | Molecular Plant |
| Plant Nucleotide Transformer | MaskedLM | ESM | ✅ | zhangtaolab | 100M | 1 | Molecular Plant |
| Plant DNAGemma | CausalLM | Gemma | ✅ | zhangtaolab | 150M | 1 | Molecular Plant |
| Plant DNAMamba | CausalLM | Mamba | ✅ | zhangtaolab | 100M | 1 | Molecular Plant |
| Plant DNAModernBert | MaskedLM | ModernBert | ✅ | zhangtaolab | 100M | 1 | Molecular Plant |
| Nucleotide Transformer | MaskedLM | ESM | ✅ | InstaDeepAI | 50M / 100M / 250M / 500M / 2.5B | 8 | Nature Methods |
| AgroNT | MaskedLM | ESM | ✅ | InstaDeepAI | 1B | 1 | Current Biology |
| Caduceus-Ph | MaskedLM | Caduceus | ✅ | Kuleshov-Group | 0.5M / 2M / 8M | 3 | arXiv |
| Caduceus-Ps | MaskedLM | Caduceus | ✅ | Kuleshov-Group | 0.5M / 2M / 8M | 3 | arXiv |
| PlantCaduceus | MaskedLM | Caduceus | ✅ | Kuleshov-Group | 20M / 40M / 112M / 225M | 4 | PNAS |
| PlantCAD2 | MaskedLM | Caduceus | ✅ | Kuleshov-Group | 88M / 311M / 694M | 3 | bioRxiv |
| DNABERT | MaskedLM | BERT | ✅ | Zhihan1996 | 100M | 4 | Bioinformatics |
| DNABERT-2 | MaskedLM | BERT | ✅ | Zhihan1996 | 117M | 1 | arXiv |
| DNABERT-S | MaskedLM | BERT | ✅ | Zhihan1996 | 117M | 1 | arXiv |
| EVO-1 | CausalLM | StripedHyena | ❌ | togethercomputer | 6.5B | 2 | Science |
| EVO-2 | CausalLM | StripedHyena2 | ❌ | arcinstitute | 1B / 1.5B / 7B / 40B | 4 | bioRxiv |
| GENA-LM | MaskedLM | BERT | ✅ | AIRI-Institute | 150M / 500M | 7 | Nucleic Acids Research |
| GENA-LM-BigBird | MaskedLM | BigBird | ✅ | AIRI-Institute | 150M | 3 | Nucleic Acids Research |
| GENERator | CausalLM | Llama | ✅ | GenerTeam | 1.2B / 3B | 2 | arXiv |
| GENERanno | CausalLM | Generanno | ✅ | GenerTeam | 0.5B | 2 | bioRxiv |
| GenomeOcean | CausalLM | Mistral | ✅ | DOEJGI | 100M / 500M / 4B | 3 | bioRxiv |
| GPN | MaskedLM | ConvNet | ✅ | songlab | 60M | 1 | PNAS |
| GROVER | MaskedLM | BERT | ✅ | PoetschLab | 100M | 1 | Nature Machine Intelligence |
| HyenaDNA | CausalLM | HyenaDNA | ✅ | LongSafari | 0.5M / 0.7M / 2M / 4M / 15M / 30M / 55M | 7 | arXiv |
| LucaOne | MaskedLM | LucaGPLM | ⭕ | LucaGroup | 5.6M / 17.6M / 36M | 3 | Nature Machine Intelligence |
| JanusDNA | MaskedLM | JanusDNA | ⭕ | Qihao-Duan | unknown | 6 | arXiv |
| Jamba-DNA | CausalLM | Jamba | ✅ | RaphaelMourad | 114M | 1 | GitHub |
| Mistral-DNA | CausalLM | Mistral | ✅ | RaphaelMourad | 1M / 17M / 138M / 417M / 422M | 10 | [GitHub](https://github.com/raphaelmourad/ |
| ModernBert-DNA | MaskedLM | ModernBert | ✅ | RaphaelMourad | 37M | 3 | GitHub |
| megaDNA | CausalLM | MEGADNA | ⭕ | lingxusb | 78M / 145M / 277M | 3 | arXiv |
| MutBERT | MaskedLM | RoPEBert | ✅ | JadenLong | 86M | 3 | bioRxiv |
| OmniNA | CausalLM | Llama | ✅ | XLS | 66M / 220M | 2 | bioRxiv |
| Omni-DNA | CausalLM | OLMoModel | ✅ | zehui127 | 20M / 60M / 116M / 300M / 700M / 1B | 6 | arXiv |
| plant-genomic-jamba | CausalLM | StripedMamba | ✅ | suzuki-2001 | 50M | 1 | GitHub |
| ProkBERT | MaskedLM | MegatronBert | ✅ | neuralbioinfo | 21M / 25M / 27M | 3 | Frontiers in Microbiology |
Model Categories¶
By Architecture Type¶
Masked Language Models (MLM)¶
- BERT-based: DNABERT, DNABERT-2, DNABERT-S, Plant DNABERT, GENA-LM, GROVER, MutBERT, ProkBERT, MutBERT, Plant DNAModernBert
- ESM-based: Nucleotide Transformer, AgroNT, Plant Nucleotide Transformer
- Caduceus-based: Caduceus-Ph, Caduceus-Ps, PlantCaduceus、PlantCAD2
- Other: GENA-LM-BigBird, GPN, JanusDNA, LucaOne
Causal Language Models (CLM)¶
- Llama-based: GENERator, OmniNA
- Mistral-based: GenomeOcean, Mistral-DNA
- Hyena-based: HyenaDNA, EVO-1, EVO-2
- Other: Jamba-DNA, plant-genomic-jamba, Plant DNAGPT, Plant DNAGemma, Plant DNAMamba, Omni-DNA, megaDNA
By Model Size¶
| Size Category | Model Count | Examples |
|---|---|---|
| Small (<100M) | 15 | Caduceus-Ph, HyenaDNA variants, ModernBert-DNA |
| Medium (100M-1B) | 18 | DNABERT series, Plant models, GENA-LM |
| Large (1B-10B) | 8 | Nucleotide Transformer, EVO-1, GENERator |
| Extra Large (>10B) | 3 | EVO-2 (40B) |
By Source Platform¶
| Platform | Model Count | Examples |
|---|---|---|
| Hugging Face Hub | 25+ | Most models with direct integration |
| ModelScope | 10+ | Alternative source for some models |
| GitHub | 8 | Community-contributed models |
| Academic Journals | 15+ | Peer-reviewed publications |
Usage Guidelines¶
Fine-tuning Support¶
- ✅ Native Supported: 35 models with full fine-tuning capabilities based on its own model implementation
- ⭕ Custom Supported: 3 models (LucaOne, megaDNA, JanusDNA) with fine-tuning capabilities based on custom implementation for sequence classification
- ❌ Not Supported: 2 models (EVO-1, EVO-2) - inference only
Model Selection Tips¶
- For Classification Tasks: Choose BERT-based models (DNABERT, Plant DNABERT)
- For Generation Tasks: Use CausalLM models (Plant DNAGPT, GenomeOcean)
- For Large-scale Analysis: Consider Nucleotide Transformer or EVO models
- For Plant-specific Tasks: Prefer Plant-prefixed models
Plant Models¶
The following models are specifically designed for plant genomics:
- Plant DNABERT: BERT-based model for plant DNA sequence analysis
- Plant DNAGPT: GPT-based model for plant DNA sequence generation
- Plant Nucleotide Transformer: ESM-based model for plant genomics
- Plant DNAGemma: Gemma-based model for plant DNA analysis
- Plant DNAMamba: Mamba-based model for efficient plant sequence processing
- Plant DNAModernBert: ModernBert-based model for plant genomics
- PlantCaduceus: Caduceus-based model for plant sequence analysis
Performance Considerations¶
- Small Models (<100M): Fast inference, suitable for real-time applications
- Medium Models (100M-1B): Good balance of performance and speed
- Large Models (>1B): Best performance but slower inference
Getting Started¶
To use any of these models with DNALLM:
from dnallm import load_model_and_tokenizer
# Load a supported model
model, tokenizer = load_model_and_tokenizer(
"zhangtaolab/plant-dnabert-BPE",
source="huggingface"
)
# For fine-tuning
from dnallm.finetune import DNATrainer
trainer = DNATrainer(model=model, tokenizer=tokenizer)
Contributing New Models¶
To add support for new DNA language models:
- Ensure the model is publicly available
- Test compatibility with DNALLM's architecture
- Submit a pull request with integration code
- Include proper documentation and examples
For detailed integration instructions, see the Development Guide.