Model Zoo¶
DNALLM includes almost all publicly available DNA Large Language Models and some DNA-based deep learning models. We have adapted these models to work seamlessly with the DNALLM package for fine-tuning and inference.
Model Collection¶
The following table shows all currently supported models and their fine-tuning/inference capabilities:
Model Name | Author | Model Type | Architecture | Model Size | Count | Source | Fine-tuning Support |
---|---|---|---|---|---|---|---|
Nucleotide Transformer | InstaDeepAI | MaskedLM | ESM | 50M / 100M / 250M / 500M / 2.5B | 8 | Nature Methods | ✅ |
AgroNT | InstaDeepAI | MaskedLM | ESM | 1B | 1 | Current Biology | ✅ |
Caduceus-Ph | Kuleshov-Group | MaskedLM | Caduceus | 0.5M / 2M / 8M | 3 | arXiv | ✅ |
Caduceus-Ps | Kuleshov-Group | MaskedLM | Caduceus | 0.5M / 2M / 8M | 3 | arXiv | ✅ |
PlantCaduceus | Kuleshov-Group | MaskedLM | Caduceus | 20M / 40M / 112M / 225M | 4 | bioRxiv | ✅ |
DNABERT | Zhihan1996 | MaskedLM | BERT | 100M | 4 | Bioinformatics | ✅ |
DNABERT-2 | Zhihan1996 | MaskedLM | BERT | 117M | 1 | arXiv | ✅ |
DNABERT-S | Zhihan1996 | MaskedLM | BERT | 117M | 1 | arXiv | ✅ |
GENA-LM | AIRI-Institute | MaskedLM | BERT | 150M / 500M | 7 | Nucleic Acids Research | ✅ |
GENA-LM-BigBird | AIRI-Institute | MaskedLM | BigBird | 150M | 3 | Nucleic Acids Research | ✅ |
GENERator | GenerTeam | CausalLM | Llama | 0.5B / 1.2B / 3B | 4 | arXiv | ✅ |
GenomeOcean | pGenomeOcean | CausalLM | Mistral | 100M / 500M / 4B | 3 | bioRxiv | ✅ |
GPN | songlab | MaskedLM | ConvNet | 60M | 1 | PNAS | ❌ |
GROVER | PoetschLab | MaskedLM | BERT | 100M | 1 | Nature Machine Intelligence | ✅ |
HyenaDNA | LongSafari | CausalLM | HyenaDNA | 0.5M / 0.7M / 2M / 4M / 15M / 30M / 55M | 7 | arXiv | ✅ |
Jamba-DNA | RaphaelMourad | CausalLM | Jamba | 114M | 1 | GitHub | ✅ |
Mistral-DNA | RaphaelMourad | CausalLM | Mistral | 1M / 17M / 138M / 417M / 422M | 10 | GitHub | ✅ |
ModernBert-DNA | RaphaelMourad | MaskedLM | ModernBert | 37M | 3 | GitHub | ✅ |
MutBERT | JadenLong | MaskedLM | RoPEBert | 86M | 3 | bioRxiv | ✅ |
OmniNA | XLS | CausalLM | Llama | 66M / 220M | 2 | bioRxiv | ✅ |
Omni-DNA | zehui127 | CausalLM | OLMoModel | 20M / 60M / 116M / 300M / 700M / 1B | 6 | arXiv | ❌ |
EVO-1 | togethercomputer | CausalLM | StripedHyena | 6.5B | 2 | GitHub | ❌ |
EVO-2 | arcinstitute | CausalLM | StripedHyena2 | 1B / 7B / 40B | 3 | GitHub | ❌ |
ProkBERT | neuralbioinfo | MaskedLM | MegatronBert | 21M / 25M / 27M | 3 | Frontiers in Microbiology | ✅ |
Plant DNABERT | zhangtaolab | MaskedLM | BERT | 100M | 1 | Molecular Plant | ✅ |
Plant DNAGPT | zhangtaolab | CausalLM | GPT2 | 100M | 1 | Molecular Plant | ✅ |
Plant Nucleotide Transformer | zhangtaolab | MaskedLM | ESM | 100M | 1 | Molecular Plant | ✅ |
Plant DNAGemma | zhangtaolab | CausalLM | Gemma | 150M | 1 | Molecular Plant | ✅ |
Plant DNAMamba | zhangtaolab | CausalLM | Mamba | 100M | 1 | Molecular Plant | ✅ |
Plant DNAModernBert | zhangtaolab | MaskedLM | ModernBert | 100M | 1 | Molecular Plant | ✅ |
Model Categories¶
By Architecture Type¶
Masked Language Models (MLM)¶
- BERT-based: DNABERT, DNABERT-2, DNABERT-S, Plant DNABERT, GENA-LM, GROVER, MutBERT, ProkBERT, Plant DNAModernBert
- ESM-based: Nucleotide Transformer, AgroNT, Plant Nucleotide Transformer
- Caduceus-based: Caduceus-Ph, Caduceus-Ps, PlantCaduceus
- Other: GENA-LM-BigBird, GPN
Causal Language Models (CLM)¶
- Llama-based: GENERator, OmniNA
- Mistral-based: GenomeOcean, Mistral-DNA
- Hyena-based: HyenaDNA, EVO-1, EVO-2
- Other: Jamba-DNA, Plant DNAGPT, Plant DNAGemma, Plant DNAMamba, Omni-DNA
By Model Size¶
Size Category | Model Count | Examples |
---|---|---|
Small (<100M) | 15 | Caduceus-Ph, HyenaDNA variants, ModernBert-DNA |
Medium (100M-1B) | 18 | DNABERT series, Plant models, GENA-LM |
Large (1B-10B) | 8 | Nucleotide Transformer, EVO-1, GENERator |
Extra Large (>10B) | 3 | EVO-2 (40B) |
By Source Platform¶
Platform | Model Count | Examples |
---|---|---|
Hugging Face Hub | 25+ | Most models with direct integration |
ModelScope | 10+ | Alternative source for some models |
GitHub | 8 | Community-contributed models |
Academic Journals | 15+ | Peer-reviewed publications |
Usage Guidelines¶
Fine-tuning Support¶
- ✅ Supported: 35 models with full fine-tuning capabilities
- ❌ Not Supported: 3 models (GPN, Omni-DNA, EVO-2) - inference only
Model Selection Tips¶
- For Classification Tasks: Choose BERT-based models (DNABERT, Plant DNABERT)
- For Generation Tasks: Use CausalLM models (Plant DNAGPT, GenomeOcean)
- For Large-scale Analysis: Consider Nucleotide Transformer or EVO models
- For Plant-specific Tasks: Prefer Plant-prefixed models
Performance Considerations¶
- Small Models (<100M): Fast inference, suitable for real-time applications
- Medium Models (100M-1B): Good balance of performance and speed
- Large Models (>1B): Best performance but slower inference
Getting Started¶
To use any of these models with DNALLM:
from dnallm import load_model_and_tokenizer
# Load a supported model
model, tokenizer = load_model_and_tokenizer(
"zhangtaolab/plant-dnabert-BPE",
source="huggingface"
)
# For fine-tuning
from dnallm.finetune import DNATrainer
trainer = DNATrainer(model=model, tokenizer=tokenizer)
Contributing New Models¶
To add support for new DNA language models:
- Ensure the model is publicly available
- Test compatibility with DNALLM's architecture
- Submit a pull request with integration code
- Include proper documentation and examples
For detailed integration instructions, see the Development Guide.