DNALLM - DNA Large Language Model Toolkit¶

DNALLM is an open-source toolkit designed for large language model (LLM) applications in DNA sequence analysis and bioinformatics. It provides a comprehensive suite for model training, fine-tuning, inference, benchmarking, and evaluation, specifically tailored for DNA and genomics tasks.

Key Features¶

Model Training & Fine-tuning: Supports a variety of DNA-related tasks, including classification, regression, named entity recognition (NER), masked language modeling (MLM), and more.
Inference & Benchmarking: Enables efficient model inference, batch prediction, mutagenesis effect analysis, and multi-model benchmarking with visualization tools.
Data Processing: Tools for dataset generation, cleaning, formatting, and adaptation to various DNA sequence formats.
Model Management: Flexible loading and switching between different DNA language models, supporting both native mamba and transformer-compatible architectures.
Extensibility: Modular design with utility functions and configuration modules for easy integration and secondary development.
Protocol Support: Implements Model Context Protocol (MCP) for server/client deployment and integration into larger systems.
Rich Examples & Documentation: Includes interactive examples (marimo, notebooks) and detailed documentation to help users get started quickly.

Quick Start¶

Install dependencies (recommended: uv)

curl -LsSf https://astral.sh/uv/install.sh | sh

git clone https://github.com/zhangtaolab/DNALLM.git

cd DNALLM

uv venv

source .venv/bin/activate

uv pip install -e '.[base]'

Launch Jupyter Lab or Marimo for interactive development:

uv run jupyter lab
   # or
uv run marimo run xxx.py

Project Structure¶

dnallm/ : Core library (CLI, configuration, datasets, finetune, inference, models, tasks, utils, MCP)
example/ : Interactive and notebook-based examples
docs/ : Documentation
scripts/ : Utility scripts
tests/ : Test suite

For more details, please refer to the README.md and contribution guidelines.