Model Zoo¶

DNALLM includes almost all publicly available DNA Large Language Models and some DNA-based deep learning models. We have adapted these models to work seamlessly with the DNALLM package for fine-tuning and inference.

Model Collection¶

The following table shows all currently supported models and their fine-tuning/inference capabilities:

Model Name	Author	Model Type	Architecture	Model Size	Count	Source	Fine-tuning Support
Nucleotide Transformer	InstaDeepAI	MaskedLM	ESM	50M / 100M / 250M / 500M / 2.5B	8	Nature Methods	✅
AgroNT	InstaDeepAI	MaskedLM	ESM	1B	1	Current Biology	✅
Caduceus-Ph	Kuleshov-Group	MaskedLM	Caduceus	0.5M / 2M / 8M	3	arXiv	✅
Caduceus-Ps	Kuleshov-Group	MaskedLM	Caduceus	0.5M / 2M / 8M	3	arXiv	✅
PlantCaduceus	Kuleshov-Group	MaskedLM	Caduceus	20M / 40M / 112M / 225M	4	bioRxiv	✅
DNABERT	Zhihan1996	MaskedLM	BERT	100M	4	Bioinformatics	✅
DNABERT-2	Zhihan1996	MaskedLM	BERT	117M	1	arXiv	✅
DNABERT-S	Zhihan1996	MaskedLM	BERT	117M	1	arXiv	✅
GENA-LM	AIRI-Institute	MaskedLM	BERT	150M / 500M	7	Nucleic Acids Research	✅
GENA-LM-BigBird	AIRI-Institute	MaskedLM	BigBird	150M	3	Nucleic Acids Research	✅
GENERator	GenerTeam	CausalLM	Llama	0.5B / 1.2B / 3B	4	arXiv	✅
GenomeOcean	pGenomeOcean	CausalLM	Mistral	100M / 500M / 4B	3	bioRxiv	✅
GPN	songlab	MaskedLM	ConvNet	60M	1	PNAS	❌
GROVER	PoetschLab	MaskedLM	BERT	100M	1	Nature Machine Intelligence	✅
HyenaDNA	LongSafari	CausalLM	HyenaDNA	0.5M / 0.7M / 2M / 4M / 15M / 30M / 55M	7	arXiv	✅
Jamba-DNA	RaphaelMourad	CausalLM	Jamba	114M	1	GitHub	✅
Mistral-DNA	RaphaelMourad	CausalLM	Mistral	1M / 17M / 138M / 417M / 422M	10	GitHub	✅
ModernBert-DNA	RaphaelMourad	MaskedLM	ModernBert	37M	3	GitHub	✅
MutBERT	JadenLong	MaskedLM	RoPEBert	86M	3	bioRxiv	✅
OmniNA	XLS	CausalLM	Llama	66M / 220M	2	bioRxiv	✅
Omni-DNA	zehui127	CausalLM	OLMoModel	20M / 60M / 116M / 300M / 700M / 1B	6	arXiv	❌
EVO-1	togethercomputer	CausalLM	StripedHyena	6.5B	2	GitHub	❌
EVO-2	arcinstitute	CausalLM	StripedHyena2	1B / 7B / 40B	3	GitHub	❌
ProkBERT	neuralbioinfo	MaskedLM	MegatronBert	21M / 25M / 27M	3	Frontiers in Microbiology	✅
Plant DNABERT	zhangtaolab	MaskedLM	BERT	100M	1	Molecular Plant	✅
Plant DNAGPT	zhangtaolab	CausalLM	GPT2	100M	1	Molecular Plant	✅
Plant Nucleotide Transformer	zhangtaolab	MaskedLM	ESM	100M	1	Molecular Plant	✅
Plant DNAGemma	zhangtaolab	CausalLM	Gemma	150M	1	Molecular Plant	✅
Plant DNAMamba	zhangtaolab	CausalLM	Mamba	100M	1	Molecular Plant	✅
Plant DNAModernBert	zhangtaolab	MaskedLM	ModernBert	100M	1	Molecular Plant	✅

Model Categories¶

By Architecture Type¶

Masked Language Models (MLM)¶

BERT-based: DNABERT, DNABERT-2, DNABERT-S, Plant DNABERT, GENA-LM, GROVER, MutBERT, ProkBERT, Plant DNAModernBert
ESM-based: Nucleotide Transformer, AgroNT, Plant Nucleotide Transformer
Caduceus-based: Caduceus-Ph, Caduceus-Ps, PlantCaduceus
Other: GENA-LM-BigBird, GPN

Causal Language Models (CLM)¶

Llama-based: GENERator, OmniNA
Mistral-based: GenomeOcean, Mistral-DNA
Hyena-based: HyenaDNA, EVO-1, EVO-2
Other: Jamba-DNA, Plant DNAGPT, Plant DNAGemma, Plant DNAMamba, Omni-DNA

By Model Size¶

Size Category	Model Count	Examples
Small (<100M)	15	Caduceus-Ph, HyenaDNA variants, ModernBert-DNA
Medium (100M-1B)	18	DNABERT series, Plant models, GENA-LM
Large (1B-10B)	8	Nucleotide Transformer, EVO-1, GENERator
Extra Large (>10B)	3	EVO-2 (40B)

By Source Platform¶

Platform	Model Count	Examples
Hugging Face Hub	25+	Most models with direct integration
ModelScope	10+	Alternative source for some models
GitHub	8	Community-contributed models
Academic Journals	15+	Peer-reviewed publications

Usage Guidelines¶

Fine-tuning Support¶

✅ Supported: 35 models with full fine-tuning capabilities
❌ Not Supported: 3 models (GPN, Omni-DNA, EVO-2) - inference only

Model Selection Tips¶

For Classification Tasks: Choose BERT-based models (DNABERT, Plant DNABERT)
For Generation Tasks: Use CausalLM models (Plant DNAGPT, GenomeOcean)
For Large-scale Analysis: Consider Nucleotide Transformer or EVO models
For Plant-specific Tasks: Prefer Plant-prefixed models

Performance Considerations¶

Small Models (<100M): Fast inference, suitable for real-time applications
Medium Models (100M-1B): Good balance of performance and speed
Large Models (>1B): Best performance but slower inference

Getting Started¶

To use any of these models with DNALLM:

from dnallm import load_model_and_tokenizer

# Load a supported model
model, tokenizer = load_model_and_tokenizer(
    "zhangtaolab/plant-dnabert-BPE",
    source="huggingface"
)

# For fine-tuning
from dnallm.finetune import DNATrainer
trainer = DNATrainer(model=model, tokenizer=tokenizer)

Contributing New Models¶

To add support for new DNA language models:

Ensure the model is publicly available
Test compatibility with DNALLM's architecture
Submit a pull request with integration code
Include proper documentation and examples

For detailed integration instructions, see the Development Guide.