DNALLM - DNA Large Language Model Toolkit¶
DNALLM-Suite is a comprehensive, open-source toolkit designed for fine-tuning and inference with DNA Language Models. It provides a unified interface for working with various DNA sequence models, supporting tasks ranging from basic sequence classification to advanced in-silico mutagenesis analysis. With built-in Model Context Protocol (MCP) support, DNALLM-Suite enables seamless communication with traditional large language models, allowing for enhanced integration and interoperability in AI-powered DNA analysis workflows.
๐ Key Features¶
- ๐ Model Management: Load and switch between 150+ pre-trained DNA language models from Hugging Face and ModelScope
- ๐ฏ Multi-Task Support: Binary/multi-class classification, regression, NER, MLM, and generation tasks
- ๐ Benchmarking: Multi-model performance comparison and evaluation metrics
- ๐ง Fine-tuning: Comprehensive training pipeline with configurable parameters
- ๐ฑ Interactive Interfaces: Jupyter notebooks and Marimo-based interactive demos
- ๐ MCP Support: Model Context Protocol for server/client deployment with real-time streaming
- ๐งฌ Advanced Analysis: In-silico mutagenesis, saturation mutation analysis, and mutation effect visualization
- ๐งช Comprehensive Testing: 200+ test cases covering all major functionality
๐งฌ Supported Models¶
DNALLM-Suite supports a wide range of DNA language models including:
Masked Language Models (MLM)¶
- DNABERT Series: Plant DNABERT, DNABERT, DNABERT-2, DNABERT-S
- Caduceus Series: Caduceus-Ph, Caduceus-PS, PlantCaduceus
- Specialized Models: AgroNT, GENA-LM, GPN, GROVER, MutBERT, ProkBERT
Causal Language Models (CLM)¶
- EVO Series: EVO-1, EVO-2
- Plant Models: Plant DNAGemma, Plant DNAGPT, Plant DNAMamba
- Other Models: GENERator, GenomeOcean, HyenaDNA, Jamba-DNA, Mistral-DNA
Model Sources¶
- Hugging Face Hub: Primary model repository
- ModelScope: Alternative model source with additional models
- Custom Models: Support for locally trained or custom architectures
๐ Quick Start¶
DNALLM-Suite uses conda for environment management and uv for dependency management and packaging.
- Install dependencies (recommended: uv)
# Clone repository
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM
# Create conda environment
conda create -n dnallm python=3.12 -y
# Activate conda environment
conda activate dnallm
# Install uv in conda environment
conda install uv -c conda-forge
# Install DNALLM with base dependencies
uv pip install -e '.[base]'
# For MCP server support (optional)
uv pip install -e '.[mcp]'
# Verify installation
python -c "import dnallm; print('DNALLM installed successfully!')"
- Basic Model Loading and Inference
from dnallm import load_config, load_model_and_tokenizer, DNAInference
# Load configuration
configs = load_config("./example/notebooks/inference/inference_config.yaml")
# Load model and tokenizer
model_name = "zhangtaolab/plant-dnagpt-BPE-promoter_strength_protoplast"
model, tokenizer = load_model_and_tokenizer(
model_name, task_config=configs["task"], source="huggingface"
)
# Initialize inference engine
inference_engine = DNAInference(
config=configs, model=model, tokenizer=tokenizer
)
# Make inference
sequence = "AATATATTTAATCGGTGTATAATTTCTGTGAAGATCCTCGATACTTCATATAAGAGATTTTGAGAGAGAGAGAGAACCAATTTTCGAATGGGTGAGTTGGCAAAGTATTCACTTTTCAGAACATAATTGGGAAACTAGTCACTTTACTATTCAAAATTTGCAAAGTAGTC"
inference_result = inference_engine.infer(sequence)
print(f"Inference result: {inference_result}")
- In-silico Mutagenesis Analysis
from dnallm import Mutagenesis
# Initialize mutagenesis analyzer
mutagenesis = Mutagenesis(config=configs, model=model, tokenizer=tokenizer)
# Generate saturation mutations
mutagenesis.mutate_sequence(sequence, replace_mut=True)
# Evaluate mutation effects
predictions = mutagenesis.evaluate(strategy="mean")
# Visualize results
plot = mutagenesis.plot(predictions, save_path="mutation_effects.pdf")
- Model Fine-tuning
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
# Prepare dataset
dataset = DNADataset.from_huggingface(
"zhangtaolab/plant-multi-species-core-promoters",
seq_col="sequence",
label_col="label",
tokenizer=tokenizer,
)
# Initialize trainer
trainer = DNATrainer(model=model, config=configs, datasets=dataset)
# Start training
trainer.train()
- MCP Server Deployment
# Start MCP server for real-time DNA sequence prediction
from dnallm.mcp import DNALLMMCPServer
# Initialize MCP server
server = DNALLMMCPServer("config/mcp_server_config.yaml")
await server.initialize()
# Start server with SSE transport for real-time streaming
server.start_server(host="0.0.0.0", port=8000, transport="sse")
- Launch Jupyter Lab, Marimo or Gradio App for interactive development:
Interactive Demos (Marimo)¶
# Fine-tuning demo
uv run --no-sync marimo run example/marimo/finetune/finetune_demo.py
# Inference demo
uv run --no-sync marimo run example/marimo/inference/inference_demo.py
# Benchmark demo
uv run --no-sync marimo run example/marimo/benchmark/benchmark_demo.py
Jupyter Notebooks¶
# Launch Jupyter Lab
uv run --no-sync jupyter lab
# Available notebooks:
# - example/notebooks/finetune_binary/ - Binary classification fine-tuning
# - example/notebooks/finetune_multi_labels/ - Multi-label classification
# - example/notebooks/finetune_NER_task/ - Named Entity Recognition
# - example/notebooks/inference/ - Model inference
# - example/notebooks/in_silico_mutagenesis/ - Mutation analysis
# - example/notebooks/inference_for_tRNA/ - tRNA-specific analysis
# - example/notebooks/generation_evo_models/ - EVO model inference
# - example/notebooks/lora_finetune_inference/ - LoRA fine-tuning
# - example/notebooks/embedding_attention.ipynb - Embedding and attention analysis
# - example/notebooks/finetune_custom_head/ - Custom classification head
# - example/notebooks/finetune_generation/ - Sequence generation
# - example/notebooks/generation/ - Sequence generation examples
# - example/notebooks/generation_megaDNA/ - MegaDNA model inference
# - example/notebooks/interpretation/ - Model interpretation
# - example/notebooks/data_prepare/ - Data preparation examples
# - example/notebooks/benchmark/ - Model evaluation and benchmarking
Web-based UI (Gradio)¶
# Launch Gradio configuration generator app
uv run --no-sync python ui/run_config_app.py
# Or run the model config generator directly
uv run --no-sync python ui/model_config_generator_app.py
# For Generation, we also provide a app
uv run --no-sync python ui/generation_task_app.py
๐ฏ Supported Task Types¶
DNALLM-Suite supports the following task types:
- EMBEDDING: Extract embeddings, attention maps, and token probabilities for downstream analysis
- MASK: Masked language modeling task for pre-training
- GENERATION: Text generation task for causal language models
- BINARY: Binary classification task with two possible labels
- MULTICLASS: Multi-class classification task that specifies which class the input belongs to (more than two)
- MULTILABEL: Multi-label classification task with multiple binary labels per sample
- REGRESSION: Regression task which returns a continuous score
- NER: Token classification task which is usually for Named Entity Recognition
๐๏ธ Project Structure¶
DNALLM/
โโโ dnallm/ # Core library package
โ โโโ cli/ # Command-line interface
โ โโโ configuration/ # Configuration management
โ โโโ datahandling/ # Dataset processing
โ โโโ finetune/ # Fine-tuning pipeline
โ โโโ inference/ # Inference & analysis tools
โ โโโ models/ # Model loading & registry
โ โโโ tasks/ # Task definitions & metrics
โ โโโ utils/ # Utility functions
โ โโโ mcp/ # MCP server implementation
โโโ cli/ # Legacy CLI scripts (deprecated)
โโโ example/ # Examples & tutorials
โ โโโ marimo/ # Interactive Marimo apps
โ โโโ notebooks/ # Jupyter notebooks
โโโ docs/ # Documentation
โโโ tests/ # Test suite
โโโ ui/ # Gradio web interfaces
โโโ scripts/ # Development scripts
โโโ .github/ # GitHub workflows
โโโ pyproject.toml # Project configuration
โโโ README.md # This file
๐ Documentation¶
For more details, please refer to the following guidelines.
- Getting Started - Installation and basic usage
- API Reference - Detailed function documentation
- Concepts - Core concepts and architecture
-
FAQ - Common questions and solutions
-
DeepWiki - A documentation that can ask
๐ค Contributing¶
We welcome contributions! Please see our Contributing Guide for details on:
- Code style and standards
- Testing requirements
- Pull request process
- Development setup
- Development setup
๐ License¶
This project is licensed under the MIT License.
๐ Acknowledgments¶
- Hugging Face - Model hosting and transformers library
- ModelScope - Alternative model repository
๐ Support¶
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Examples: Check the
example/directory for working code
DNALLM - Empowering DNA sequence analysis with state-of-the-art language models.