Skip to content

Quick Start

This guide will help you get started with DNALLM quickly. DNALLM is a comprehensive, open-source toolkit designed for fine-tuning and inference with DNA Language Models.

Prerequisites

  • Python 3.10 or higher (Python 3.12 recommended)
  • Git
  • CUDA-compatible GPU (optional, for GPU acceleration)
  • Environment Manager: Choose one of the following:
  • Python venv (built-in)
  • Conda/Miniconda (recommended for scientific computing)

Installation

DNALLM uses uv for dependency management and packaging.

What is uv is a fast Python package manager that is 10-100x faster than traditional tools like pip.

Method 1: Using venv + uv

# Clone repository
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM

# Create virtual environment
python -m venv .venv

# Activate virtual environment
source .venv/bin/activate  # Linux/MacOS
# or
.venv\Scripts\activate     # Windows

# Upgrade pip (recommended)
pip install --upgrade pip

# Install uv in virtual environment
pip install uv

# Install DNALLM with base dependencies
uv pip install -e '.[base]'

# Verify installation
python -c "import dnallm; print('DNALLM installed successfully!')"

Method 2: Using conda + uv

# Clone repository
git clone https://github.com/zhangtaolab/DNALLM.git
cd DNALLM

# Create conda environment
conda create -n dnallm python=3.12 -y

# Activate conda environment
conda activate dnallm

# Install uv in conda environment
conda install uv -c conda-forge

# Install DNALLM with base dependencies
uv pip install -e '.[base]'

# Verify installation
python -c "import dnallm; print('DNALLM installed successfully!')"

GPU Support

For GPU acceleration, install the appropriate CUDA version:

# For venv users: activate virtual environment
source .venv/bin/activate  # Linux/MacOS
# or
.venv\Scripts\activate     # Windows

# For conda users: activate conda environment
# conda activate dnallm

# CUDA 12.4 (recommended for recent GPUs)
uv pip install -e '.[cuda124]'

# Other supported versions: cpu, cuda121, cuda126, cuda128
uv pip install -e '.[cuda121]'

Native Mamba Support

Native Mamba architecture runs significantly faster than transformer-compatible Mamba architecture, but native Mamba depends on Nvidia GPUs.

If you need native Mamba architecture support, after installing DNALLM dependencies, use the following command:

# For venv users: activate virtual environment
source .venv/bin/activate  # Linux/MacOS
# or
.venv\Scripts\activate     # Windows

# For conda users: activate conda environment
# conda activate dnallm

# Install Mamba support
uv pip install -e '.[mamba]' --no-cache-dir --no-build-isolation

Please ensure your machine can connect to GitHub, otherwise Mamba dependencies may fail to download.

Basic Usage

1. Basic Model Loading and Inference

from dnallm import load_config, load_model_and_tokenizer
from dnallm.inference import DNAInference

# Load configuration
configs = load_config("./example/notebooks/inference/inference_config.yaml")

# Load model and tokenizer
model_name = "zhangtaolab/plant-dnagpt-BPE-promoter"
model, tokenizer = load_model_and_tokenizer(
    model_name, 
    task_config=configs["task"], 
    source="huggingface"
)

# Initialize inference engine
inference_engine = DNAInference(config=configs, model=model, tokenizer=tokenizer)

# Make inference
sequence = "TCACATCCGGGTGAAACCTCGAGTTCCTATAACCTGCCGACAGGTGGCGGGTCTTATAAAACTGATCACTACAATTCCCAATGGAAAAAAAAAAAAAAAAACCCTTATTTGACTCTCATTATAGATCAACGATGGATCTAGCTCTTCTTTTGTAATTACCTGACTTTTGACCTGACGAACCAAGTTATCGGTTGGGGCCCTGTCAAACGACAGGTCGCTTAGAGGGCATATGTGAGAAAAAGGGTCCTGTTTTTTATCCACGGAGAAAGAAAGCAAGAAGAGGAGAGGTTTTAAAAAAAA"
inference_result = inference_engine.infer(sequence)
print(f"Inference result: {inference_result}")

2. In-silico Mutagenesis Analysis

from dnallm import load_config
from dnallm.inference import Mutagenesis

# Load configuration
configs = load_config("./example/notebooks/in_silico_mutagenesis/inference_config.yaml")

# Load model and tokenizer
model_name = "zhangtaolab/plant-dnagpt-BPE-promoter_strength_protoplast"
model, tokenizer = load_model_and_tokenizer(
    model_name,
    task_config=configs["task"],
    source="huggingface"
)

# Initialize mutagenesis analyzer
mutagenesis = Mutagenesis(config=configs, model=model, tokenizer=tokenizer)

# Generate saturation mutations
sequence = "AATATATTTAATCGGTGTATAATTTCTGTGAAGATCCTCGATACTTCATATAAGAGATTTTGAGAGAGAGAGAGAACCAATTTTCGAATGGGTGAGTTGGCAAAGTATTCACTTTTCAGAACATAATTGGGAAACTAGTCACTTTACTATTCAAAATTTGCAAAGTAGTC"
mutagenesis.mutate_sequence(sequence, replace_mut=True)

# Evaluate mutation effects
predictions = mutagenesis.evaluate(strategy="mean")

# Visualize results
plot = mutagenesis.plot(predictions, save_path="mutation_effects.pdf")

3. Model Fine-tuning

from dnallm import load_config
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer

# Load configuration
configs = load_config("./example/notebooks/finetune_binary/finetune_config.yaml")

# Load model and tokenizer
model_name = "zhangtaolab/plant-dnabert-BPE"
model, tokenizer = load_model_and_tokenizer(
    model_name,
    task_config=configs["task"],
    source="huggingface"
)

# Prepare dataset
dataset = DNADataset.load_local_data(
    file_paths="./tests/test_data/binary_classification/train.csv",
    seq_col="sequence",
    label_col="label",
    tokenizer=tokenizer,
)

# Encode the sequences in the dataset
dataset.encode_sequences()

# Initialize trainer
trainer = DNATrainer(
    config=configs,
    model=model,
    datasets=dataset
)

# Start training
trainer.train()

4. Models Benchmark

from dnallm import load_config
from dnallm.inference import Benchmark

# Load configuration
configs = load_config("./example/notebooks/benchmark/benchmark_config.yaml")

# Initialize benchmark
benchmark = Benchmark(config=configs)

# Run benchmark
results = benchmark.run()

# Display results
for dataset_name, dataset_results in results.items():
    print(f"\n{dataset_name}:")
    for model_name, metrics in dataset_results.items():
        print(f"  {model_name}:")
        for metric, value in metrics.items():
            if metric not in ["curve", "scatter"]:
                print(f"    {metric}: {value:.4f}")

# Plot metrics
# pbar: bar chart for all the scores, pline: ROC curve
pbar, pline = benchmark.plot(results, save_path="plot.pdf")

Examples and Tutorials

Interactive Demos (Marimo)

# Launch Jupyter Lab
uv run jupyter lab

# Launch Marimo
uv run marimo run xxx.py

# Fine-tuning demo
uv run marimo run example/marimo/finetune/finetune_demo.py

# Inference demo
uv run marimo run example/marimo/inference/inference_demo.py

# Benchmark demo
uv run marimo run example/marimo/benchmark/benchmark_demo.py

Jupyter Notebooks

# Launch Jupyter Lab
uv run jupyter lab

# Available notebooks:
# - example/notebooks/finetune_plant_dnabert/ - Classification fine-tuning
# - example/notebooks/finetune_multi_labels/ - Multi-label classification
# - example/notebooks/finetune_NER_task/ - Named Entity Recognition
# - example/notebooks/inference_and_benchmark/ - Model evaluation
# - example/notebooks/in_silico_mutagenesis/ - Mutation analysis
# - example/notebooks/embedding_attention.ipynb - Embedding and attention analysis

Command Line Interface

DNALLM provides convenient CLI tools:

# Training
dnallm-train --config path/to/config.yaml

# Inference
dnallm-inference --config path/to/config.yaml --input path/to/sequences.txt

# Model configuration generator
dnallm-model-config-generator

# MCP server
dnallm-mcp-server --config path/to/config.yaml

Supported Task Types

DNALLM supports the following task types:

  • EMBEDDING: Extract embeddings, attention maps, and token probabilities for downstream analysis
  • MASK: Masked language modeling task for pre-training
  • GENERATION: Text generation task for causal language models
  • BINARY: Binary classification task with two possible labels
  • MULTICLASS: Multi-class classification task that specifies which class the input belongs to (more than two)
  • MULTILABEL: Multi-label classification task with multiple binary labels per sample
  • REGRESSION: Regression task which returns a continuous score
  • NER: Token classification task which is usually for Named Entity Recognition

Next Steps

  • Explore the API documentation for detailed function references
  • Check out tutorials for specific use cases
  • Visit the FAQ for common questions
  • Join the community discussions on GitHub

Need Help?

  • Documentation: Browse the complete documentation
  • Issues: Report bugs or request features on GitHub
  • Examples: Check the example notebooks for working code
  • Configuration: Refer to the configuration examples in the docs