Skip to content

Model Guides

This page provides access to comprehensive guides for different DNA language model architectures and their usage with DNALLM.

Model Architecture Guides

Core Architectures

  • BERT Models: DNABERT, DNABERT-2, and BERT-based models for DNA sequence analysis
  • Caduceus Models: Caduceus-Ph, Caduceus-Ps, and PlantCaduceus models
  • ESM Models: Nucleotide Transformer and ESM-based models
  • Hyena Models: HyenaDNA and Hyena-based architectures
  • Llama Models: GENERator, OmniNA, and Llama-based models

Specialized Architectures

Model Resources

Selection and Troubleshooting

Quick Reference

By Task Type

Task Type Recommended Models Guide
Classification DNABERT, Plant DNABERT BERT Models
Generation Plant DNAGPT, GenomeOcean Llama Models
Long Sequences EVO-1, EVO-2 EVO Models
Efficient Processing DNAMamba, Mamba variants Mamba Models
Plant-specific Plant DNABERT, PlantCaduceus Plant Models

By Model Size

Size Category Examples Use Case
Small (<100M) Caduceus-Ph, HyenaDNA Fast inference, real-time applications
Medium (100M-1B) DNABERT, Plant models Balanced performance and speed
Large (1B-10B) Nucleotide Transformer, EVO-1 High accuracy, complex tasks
Extra Large (>10B) EVO-2 (40B) State-of-the-art performance

Getting Started

Basic Model Loading

from dnallm import load_model_and_tokenizer

# Load a DNA-specific model
model, tokenizer = load_model_and_tokenizer(
    "zhangtaolab/plant-dnabert-BPE",
    source="huggingface"
)

Model Selection Tips

  1. For Classification Tasks: Choose BERT-based models (DNABERT, Plant DNABERT)
  2. For Generation Tasks: Use CausalLM models (Plant DNAGPT, GenomeOcean)
  3. For Large-scale Analysis: Consider Nucleotide Transformer or EVO models
  4. For Plant-specific Tasks: Prefer Plant-prefixed models

For detailed information about specific model architectures and their usage, please refer to the individual model guides in the Resources section.