DNALLM - DNA Large Language Model Toolkit¶
DNALLM is an open-source toolkit designed for large language model (LLM) applications in DNA sequence analysis and bioinformatics. It provides a comprehensive suite for model training, fine-tuning, inference, benchmarking, and evaluation, specifically tailored for DNA and genomics tasks.
Key Features¶
- Model Training & Fine-tuning: Supports a variety of DNA-related tasks, including classification, regression, named entity recognition (NER), masked language modeling (MLM), and more.
- Inference & Benchmarking: Enables efficient model inference, batch prediction, mutagenesis effect analysis, and multi-model benchmarking with visualization tools.
- Data Processing: Tools for dataset generation, cleaning, formatting, and adaptation to various DNA sequence formats.
- Model Management: Flexible loading and switching between different DNA language models, supporting both native mamba and transformer-compatible architectures.
- Extensibility: Modular design with utility functions and configuration modules for easy integration and secondary development.
- Protocol Support: Implements Model Context Protocol (MCP) for server/client deployment and integration into larger systems.
- Rich Examples & Documentation: Includes interactive examples (marimo, notebooks) and detailed documentation to help users get started quickly.
Quick Start¶
- Install dependencies (recommended: uv)
curl -LsSf https://astral.sh/uv/install.sh | sh
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM
uv venv
source .venv/bin/activate
uv pip install -e '.[base]'
- Launch Jupyter Lab or Marimo for interactive development:
uv run jupyter lab
# or
uv run marimo run xxx.py
Project Structure¶
dnallm/
: Core library (CLI, configuration, datasets, finetune, inference, models, tasks, utils, MCP)example/
: Interactive and notebook-based examplesdocs/
: Documentationscripts/
: Utility scriptstests/
: Test suite
For more details, please refer to the README.md and contribution guidelines.