Skip to content

DNALLM - DNA Large Language Model Toolkit

DNALLM Logo

DNALLM-Suite is a comprehensive, open-source toolkit designed for fine-tuning and inference with DNA Language Models. It provides a unified interface for working with various DNA sequence models, supporting tasks ranging from basic sequence classification to advanced in-silico mutagenesis analysis. With built-in Model Context Protocol (MCP) support, DNALLM-Suite enables seamless communication with traditional large language models, allowing for enhanced integration and interoperability in AI-powered DNA analysis workflows.

๐Ÿš€ Key Features

  • ๐Ÿ”„ Model Management: Load and switch between 150+ pre-trained DNA language models from Hugging Face and ModelScope
  • ๐ŸŽฏ Multi-Task Support: Binary/multi-class classification, regression, NER, MLM, and generation tasks
  • ๐Ÿ“Š Benchmarking: Multi-model performance comparison and evaluation metrics
  • ๐Ÿ”ง Fine-tuning: Comprehensive training pipeline with configurable parameters
  • ๐Ÿ“ฑ Interactive Interfaces: Jupyter notebooks and Marimo-based interactive demos
  • ๐ŸŒ MCP Support: Model Context Protocol for server/client deployment with real-time streaming
  • ๐Ÿงฌ Advanced Analysis: In-silico mutagenesis, saturation mutation analysis, and mutation effect visualization
  • ๐Ÿงช Comprehensive Testing: 200+ test cases covering all major functionality

๐Ÿงฌ Supported Models

DNALLM-Suite supports a wide range of DNA language models including:

Masked Language Models (MLM)

  • DNABERT Series: Plant DNABERT, DNABERT, DNABERT-2, DNABERT-S
  • Caduceus Series: Caduceus-Ph, Caduceus-PS, PlantCaduceus
  • Specialized Models: AgroNT, GENA-LM, GPN, GROVER, MutBERT, ProkBERT

Causal Language Models (CLM)

  • EVO Series: EVO-1, EVO-2
  • Plant Models: Plant DNAGemma, Plant DNAGPT, Plant DNAMamba
  • Other Models: GENERator, GenomeOcean, HyenaDNA, Jamba-DNA, Mistral-DNA

Model Sources

  • Hugging Face Hub: Primary model repository
  • ModelScope: Alternative model source with additional models
  • Custom Models: Support for locally trained or custom architectures

๐Ÿš€ Quick Start

DNALLM-Suite uses conda for environment management and uv for dependency management and packaging.

  1. Install dependencies (recommended: uv)
# Clone repository
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM

# Create conda environment
conda create -n dnallm python=3.12 -y

# Activate conda environment
conda activate dnallm

# Install uv in conda environment
conda install uv -c conda-forge

# Install DNALLM with base dependencies
uv pip install -e '.[base]'

# For MCP server support (optional)
uv pip install -e '.[mcp]'

# Verify installation
python -c "import dnallm; print('DNALLM installed successfully!')"
  1. Basic Model Loading and Inference
from dnallm import load_config, load_model_and_tokenizer, DNAInference

# Load configuration
configs = load_config("./example/notebooks/inference/inference_config.yaml")

# Load model and tokenizer
model_name = "zhangtaolab/plant-dnagpt-BPE-promoter_strength_protoplast"
model, tokenizer = load_model_and_tokenizer(
    model_name, task_config=configs["task"], source="huggingface"
)

# Initialize inference engine
inference_engine = DNAInference(
    config=configs, model=model, tokenizer=tokenizer
)

# Make inference
sequence = "AATATATTTAATCGGTGTATAATTTCTGTGAAGATCCTCGATACTTCATATAAGAGATTTTGAGAGAGAGAGAGAACCAATTTTCGAATGGGTGAGTTGGCAAAGTATTCACTTTTCAGAACATAATTGGGAAACTAGTCACTTTACTATTCAAAATTTGCAAAGTAGTC"
inference_result = inference_engine.infer(sequence)
print(f"Inference result: {inference_result}")
  1. In-silico Mutagenesis Analysis
from dnallm import Mutagenesis

# Initialize mutagenesis analyzer
mutagenesis = Mutagenesis(config=configs, model=model, tokenizer=tokenizer)

# Generate saturation mutations
mutagenesis.mutate_sequence(sequence, replace_mut=True)

# Evaluate mutation effects
predictions = mutagenesis.evaluate(strategy="mean")

# Visualize results
plot = mutagenesis.plot(predictions, save_path="mutation_effects.pdf")
  1. Model Fine-tuning
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer

# Prepare dataset
dataset = DNADataset.from_huggingface(
    "zhangtaolab/plant-multi-species-core-promoters",
    seq_col="sequence",
    label_col="label",
    tokenizer=tokenizer,
)

# Initialize trainer
trainer = DNATrainer(model=model, config=configs, datasets=dataset)

# Start training
trainer.train()
  1. MCP Server Deployment
# Start MCP server for real-time DNA sequence prediction
from dnallm.mcp import DNALLMMCPServer

# Initialize MCP server
server = DNALLMMCPServer("config/mcp_server_config.yaml")
await server.initialize()

# Start server with SSE transport for real-time streaming
server.start_server(host="0.0.0.0", port=8000, transport="sse")
  1. Launch Jupyter Lab, Marimo or Gradio App for interactive development:

Interactive Demos (Marimo)

# Fine-tuning demo
uv run --no-sync marimo run example/marimo/finetune/finetune_demo.py

# Inference demo
uv run --no-sync marimo run example/marimo/inference/inference_demo.py

# Benchmark demo
uv run --no-sync marimo run example/marimo/benchmark/benchmark_demo.py

Jupyter Notebooks

# Launch Jupyter Lab
uv run --no-sync jupyter lab

# Available notebooks:
# - example/notebooks/finetune_binary/ - Binary classification fine-tuning
# - example/notebooks/finetune_multi_labels/ - Multi-label classification
# - example/notebooks/finetune_NER_task/ - Named Entity Recognition
# - example/notebooks/inference/ - Model inference
# - example/notebooks/in_silico_mutagenesis/ - Mutation analysis
# - example/notebooks/inference_for_tRNA/ - tRNA-specific analysis
# - example/notebooks/generation_evo_models/ - EVO model inference
# - example/notebooks/lora_finetune_inference/ - LoRA fine-tuning
# - example/notebooks/embedding_attention.ipynb - Embedding and attention analysis
# - example/notebooks/finetune_custom_head/ - Custom classification head
# - example/notebooks/finetune_generation/ - Sequence generation
# - example/notebooks/generation/ - Sequence generation examples
# - example/notebooks/generation_megaDNA/ - MegaDNA model inference
# - example/notebooks/interpretation/ - Model interpretation
# - example/notebooks/data_prepare/ - Data preparation examples
# - example/notebooks/benchmark/ - Model evaluation and benchmarking

Web-based UI (Gradio)

# Launch Gradio configuration generator app
uv run --no-sync python ui/run_config_app.py

# Or run the model config generator directly
uv run --no-sync python ui/model_config_generator_app.py

# For Generation, we also provide a app
uv run --no-sync python ui/generation_task_app.py

๐ŸŽฏ Supported Task Types

DNALLM-Suite supports the following task types:

  • EMBEDDING: Extract embeddings, attention maps, and token probabilities for downstream analysis
  • MASK: Masked language modeling task for pre-training
  • GENERATION: Text generation task for causal language models
  • BINARY: Binary classification task with two possible labels
  • MULTICLASS: Multi-class classification task that specifies which class the input belongs to (more than two)
  • MULTILABEL: Multi-label classification task with multiple binary labels per sample
  • REGRESSION: Regression task which returns a continuous score
  • NER: Token classification task which is usually for Named Entity Recognition

๐Ÿ—๏ธ Project Structure

DNALLM/
โ”œโ”€โ”€ dnallm/                  # Core library package
โ”‚   โ”œโ”€โ”€ cli/                 # Command-line interface
โ”‚   โ”œโ”€โ”€ configuration/       # Configuration management
โ”‚   โ”œโ”€โ”€ datahandling/        # Dataset processing
โ”‚   โ”œโ”€โ”€ finetune/            # Fine-tuning pipeline
โ”‚   โ”œโ”€โ”€ inference/           # Inference & analysis tools
โ”‚   โ”œโ”€โ”€ models/              # Model loading & registry
โ”‚   โ”œโ”€โ”€ tasks/               # Task definitions & metrics
โ”‚   โ”œโ”€โ”€ utils/               # Utility functions
โ”‚   โ””โ”€โ”€ mcp/                 # MCP server implementation
โ”œโ”€โ”€ cli/                     # Legacy CLI scripts (deprecated)
โ”œโ”€โ”€ example/                 # Examples & tutorials
โ”‚   โ”œโ”€โ”€ marimo/              # Interactive Marimo apps
โ”‚   โ””โ”€โ”€ notebooks/           # Jupyter notebooks
โ”œโ”€โ”€ docs/                    # Documentation
โ”œโ”€โ”€ tests/                   # Test suite
โ”œโ”€โ”€ ui/                      # Gradio web interfaces
โ”œโ”€โ”€ scripts/                 # Development scripts
โ”œโ”€โ”€ .github/                 # GitHub workflows
โ”œโ”€โ”€ pyproject.toml           # Project configuration
โ””โ”€โ”€ README.md                # This file

๐Ÿ“– Documentation

For more details, please refer to the following guidelines.

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details on:

  • Code style and standards
  • Testing requirements
  • Pull request process
  • Development setup
  • Development setup

๐Ÿ“„ License

This project is licensed under the MIT License.

๐Ÿ™ Acknowledgments

  • Hugging Face - Model hosting and transformers library
  • ModelScope - Alternative model repository

๐Ÿ“ž Support


DNALLM - Empowering DNA sequence analysis with state-of-the-art language models.