DNALLM End-to-End Tutorial: From Data to MCP Service Deployment¶
This tutorial will guide you through a complete DNA sequence analysis project, including data preparation, model training, validation evaluation, and MCP service deployment.
Table of Contents¶
- Tutorial Overview
- Task 1: Binary Classification - Promoter Strength Prediction
- Task 2: Multi-label Classification - Sequence Functional Annotation
- Task 3: Named Entity Recognition - Genomic Element Localization
- Task 4: Using LoRA for Efficient Fine-tuning
- Task 5: Model Inference and Mutagenesis Analysis
- Task 6: MCP Service Deployment
- Advanced Tips and Best Practices
- Frequently Asked Questions
1. Tutorial Overview¶
1.1 Learning Objectives¶
After completing this tutorial, you will be able to:
- Prepare and validate DNA sequence data
- Configure and execute model training
- Evaluate model performance and perform inference
- Use LoRA for parameter-efficient fine-tuning
- Deploy models as MCP services
- Understand best practices for different task types
1.2 Project Structure¶
tutorial_project/
├── data/ # Data directory
│ ├── raw/ # Raw data
│ ├── processed/ # Processed data
│ └── test.csv # Test data
├── configs/ # Configuration file directory
│ ├── finetune_config.yaml
│ └── inference_config.yaml
├── models/ # Model save directory
│ └── checkpoint-100/
├── mcp_server/ # MCP service configuration
│ ├── mcp_server_config.yaml
│ └── inference_config.yaml
├── notebooks/ # Jupyter notebooks
└── outputs/ # Training output
1.3 Environment Setup¶
First, ensure you have correctly installed DNALLM following the Installation Guide.
# Activate environment
conda activate dnallm
# Verify installation
python -c "
import dnallm
from dnallm import load_config, load_model_and_tokenizer
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
from dnallm.inference import DNAInference
print('✅ All modules imported successfully')
"
2. Task 1: Promoter Strength Prediction¶
2.1 Task Description¶
This task will train a binary classification model to predict whether a DNA sequence has promoter activity.
Dataset: Use zhangtaolab/plant-multi-species-core-promoters dataset
Model: Plant DNABERT BPE
Task Type: binary classification
2.2 Data Preparation¶
2.2.1 Download and View Data¶
from dnallm import load_config
from dnallm.datahandling import DNADataset
# Download dataset from ModelScope
data_name = "zhangtaolab/plant-multi-species-core-promoters"
datasets = DNADataset.from_modelscope(
data_name,
seq_col="sequence",
label_col="label"
)
# View dataset information
print(f"Dataset size: {len(datasets)}")
print(f"Label distribution: {datasets.get_label_distribution()}")
# Data sampling (for quick testing)
sampled_datasets = datasets.sampling(0.1, overwrite=True)
print(f"Sampled size: {len(sampled_datasets)}")
2.2.2 Local Data Format¶
If using local data, ensure the format is as follows:
CSV Format (data.csv):
sequence,label
ATGCGT...,0
ATCGAT...,1
GCTAGC...,0
...
TSV Format (data.tsv):
sequence label
ATGCGT... 0
ATCGAT... 1
GCTAGC... 0
2.2.3 Data Quality Check¶
from dnallm.datahandling import DNADataset
# Load local data
dataset = DNADataset.load_local_data(
file_paths="./data/train.csv",
seq_col="sequence",
label_col="label"
)
# Validate data quality
dataset.validate_sequences(
min_length=50,
max_length=1000,
valid_chars=["A", "T", "G", "C"]
)
# Check label distribution
print(f"Positive samples: {dataset.label_counts.get(1, 0)}")
print(f"Negative samples: {dataset.label_counts.get(0, 0)}")
# Data augmentation (add reverse complement sequences)
dataset.augment_reverse_complement()
print(f"Size after augmentation: {len(dataset)}")
2.3 Configure Training Parameters¶
Create finetune_config.yaml configuration file:
# File: configs/finetune_config.yaml
# Task configuration (required)
task:
task_type: "binary" # Binary classification task
num_labels: 2 # Number of labels
label_names: ["negative", "positive"] # Label names
threshold: 0.5 # Classification threshold
# Fine-tuning configuration
finetune:
# Output configuration
output_dir: "./models/promoter_classifier"
# Training parameters
num_train_epochs: 3
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
gradient_accumulation_steps: 1
# Optimizer parameters
learning_rate: 2e-5
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-8
max_grad_norm: 1.0
# Learning rate scheduler
warmup_ratio: 0.1
lr_scheduler_type: "linear"
lr_scheduler_kwargs: {}
# Logging and evaluation
logging_strategy: "steps"
logging_steps: 100
eval_strategy: "steps"
eval_steps: 500
# Model saving
save_strategy: "steps"
save_steps: 500
save_total_limit: 3
save_safetensors: True
# Performance optimization
fp16: True # Mixed precision training
load_best_model_at_end: True
metric_for_best_model: "eval_loss"
report_to: "tensorboard"
seed: 42
2.4 Execute Training¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
# 1. Load configuration
configs = load_config("./configs/finetune_config.yaml")
# 2. Load model and tokenizer
model_name = "zhangtaolab/plant-dnabert-BPE"
model, tokenizer = load_model_and_tokenizer(
model_name,
task_config=configs["task"],
source="modelscope" # or "huggingface"
)
# 3. Load and process dataset
datasets = DNADataset.from_modelscope(
data_name="zhangtaolab/plant-multi-species-core-promoters",
seq_col="sequence",
label_col="label",
tokenizer=tokenizer,
max_length=512
)
# Sampling (optional, for quick testing)
sampled_datasets = datasets.sampling(0.1, overwrite=True)
# Encode sequences
sampled_datasets.encode_sequences(remove_unused_columns=True)
# 4. Initialize trainer
trainer = DNATrainer(
config=configs,
model=model,
datasets=sampled_datasets
)
# 5. Start training
print("🚀 Starting training...")
metrics = trainer.train()
print(f"Training complete! Final metrics: {metrics}")
2.5 Model Validation¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
# Load best model
trainer.load_best_model()
# Evaluate on test set
test_metrics = trainer.evaluate(test_dataset=sampled_datasets.test)
print(f"Test set metrics: {test_metrics}")
# Save evaluation report
trainer.save_evaluation_report(test_metrics, save_path="./evaluation_report.json")
2.6 Complete Training Script¶
Save the above code as train_promoter.py:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
Promoter strength prediction model training script
"""
import os
import argparse
from dnallm import load_config, load_model_and_tokenizer
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
def main():
parser = argparse.ArgumentParser(description='Train promoter prediction model')
parser.add_argument('--config', type=str, default='./configs/finetune_config.yaml')
parser.add_argument('--model', type=str, default='zhangtaolab/plant-dnabert-BPE')
parser.add_argument('--data', type=str, default='zhangtaolab/plant-multi-species-core-promoters')
parser.add_argument('--sample-ratio', type=float, default=0.1)
parser.add_argument('--source', type=str, default='modelscope')
args = parser.parse_args()
# Load configuration
configs = load_config(args.config)
# Load model
model, tokenizer = load_model_and_tokenizer(
args.model,
task_config=configs["task"],
source=args.source
)
# Load data
datasets = DNADataset.from_modelscope(
data_name=args.data,
seq_col="sequence",
label_col="label",
tokenizer=tokenizer,
max_length=512
)
# Sampling
if args.sample_ratio < 1.0:
datasets = datasets.sampling(args.sample_ratio, overwrite=True)
# Encode
datasets.encode_sequences(remove_unused_columns=True)
# Train
trainer = DNATrainer(config=configs, model=model, datasets=datasets)
metrics = trainer.train()
# Save model
trainer.save_model("./models/final_model")
print(f"✅ Model saved to ./models/final_model")
if __name__ == "__main__":
main()
Run the script:
python train_promoter.py --config ./configs/finetune_config.yaml --sample-ratio 0.1
3. Task 2: Multi-label Classification¶
3.1 Task Description¶
This task will train a multi-label classification model to predict multiple functional attributes of sequences simultaneously (e.g., both promoter and enhancer).
Dataset: Custom multi-label dataset
Model: Plant DNABERT BPE
Task Type: multilabel classification
3.2 Multi-label Data Format¶
sequence,promoter,enhancer,repressor,silencer
ATGCGT...,1,0,1,0
ATCGAT...,0,1,0,1
GCTAGC...,1,1,0,0
...
3.3 Multi-label Configuration¶
# File: configs/multilabel_config.yaml
task:
task_type: "multilabel" # Multi-label classification
num_labels: 4 # Multiple labels
label_names: ["promoter", "enhancer", "repressor", "silencer"]
threshold: 0.5
finetune:
output_dir: "./models/multilabel_classifier"
num_train_epochs: 3
per_device_train_batch_size: 8
learning_rate: 2e-5
# Multi-label specific loss function configuration
loss_function: "binary_cross_entropy"
3.4 Multi-label Training Code¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
# Load configuration
configs = load_config("./configs/multilabel_config.yaml")
# Load model
model, tokenizer = load_model_and_tokenizer(
"zhangtaolab/plant-dnabert-BPE",
task_config=configs["task"],
source="modelscope"
)
# Load multi-label data
dataset = DNADataset.load_local_data(
file_paths="./data/multilabel_data.csv",
seq_col="sequence",
label_col=["promoter", "enhancer", "repressor", "silencer"],
tokenizer=tokenizer,
max_length=512
)
# Encode
dataset.encode_sequences(remove_unused_columns=True)
# Train
trainer = DNATrainer(config=configs, model=model, datasets=dataset)
metrics = trainer.train()
# Multi-label evaluation metrics
print("Multi-label evaluation metrics:")
print(f" - Hamming Loss: {metrics['eval_hamming_loss']:.4f}")
print(f" - Accuracy: {metrics['eval_accuracy']:.4f}")
print(f" - F1 Score: {metrics['eval_f1']:.4f}")
4. Task 3: Named Entity Recognition¶
4.1 Task Description¶
This task will train an NER model to identify genomic element positions in DNA sequences (e.g., genes, TSS, CDS).
Dataset: Custom NER dataset
Model: DNABERT-2
Task Type: token classification (NER)
4.2 NER Data Format¶
BIO Format:
ATGCGT... O
ATG I-GENE
CTA I-GENE
...
CSV Format:
sequence,tags
ATGCGT...,["O","O","O","B-GENE","I-GENE","I-GENE","O",...]
4.3 NER Configuration¶
# File: configs/ner_config.yaml
task:
task_type: "token" # Token classification (NER)
num_labels: 5 # Number of labels
label_names: ["O", "B-GENE", "I-GENE", "B-TS", "I-TS"]
finetune:
output_dir: "./models/ner_model"
num_train_epochs: 3
per_device_train_batch_size: 8
learning_rate: 3e-5
4.4 NER Training Code¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
# Load configuration
configs = load_config("./configs/ner_config.yaml")
# Load NER model
model, tokenizer = load_model_and_tokenizer(
"zhijunliao/dnabert-2-embedded-35m", # DNABERT-2
task_config=configs["task"],
source="huggingface"
)
# Load NER dataset
dataset = DNADataset.load_local_data(
file_paths="./data/ner_data.csv",
seq_col="sequence",
label_col="tags",
tokenizer=tokenizer,
max_length=128,
task_type="token" # Specify as token classification task
)
# Encode
dataset.encode_sequences(remove_unused_columns=True)
# Train
trainer = DNATrainer(config=configs, model=model, datasets=dataset)
metrics = trainer.train()
# NER evaluation
print("NER evaluation metrics:")
print(f" - Precision: {metrics['eval_precision']:.4f}")
print(f" - Recall: {metrics['eval_recall']:.4f}")
print(f" - F1: {metrics['eval_f1']:.4f}")
5. Task 4: Using LoRA for Efficient Fine-tuning¶
5.1 LoRA Introduction¶
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that requires training only a small number of parameters to achieve full-model fine-tuning results.
5.2 LoRA Configuration¶
Add LoRA configuration in finetune_config.yaml:
# File: configs/lora_config.yaml
task:
task_type: "binary"
num_labels: 2
label_names: ["negative", "positive"]
threshold: 0.5
finetune:
output_dir: "./models/lora_promoter"
num_train_epochs: 3
per_device_train_batch_size: 16
learning_rate: 2e-5
bf16: True # Enable BF16 acceleration
report_to: "tensorboard"
# LoRA specific configuration
lora:
r: 8 # LoRA rank
lora_alpha: 32 # LoRA scaling factor
lora_dropout: 0.1 # Dropout ratio
target_modules: # Target modules
- "x_proj"
- "in_proj"
- "out_proj"
task_type: "SEQ_CLS" # Task type
bias: "none" # Bias handling
5.3 LoRA Training Code¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
# Load configuration
configs = load_config("./configs/lora_config.yaml")
# Load model (supports Mamba, Caduceus, etc. architectures)
model_name = "kuleshov-group/PlantCAD2-Small-l24-d0768"
model, tokenizer = load_model_and_tokenizer(
model_name,
task_config=configs["task"],
source="huggingface"
)
# Load dataset
datasets = DNADataset.from_modelscope(
data_name="zhangtaolab/plant-multi-species-core-promoters",
seq_col="sequence",
label_col="label",
tokenizer=tokenizer,
max_length=512
)
# Sampling and encoding
sampled_datasets = datasets.sampling(0.05, overwrite=True)
sampled_datasets.encode_sequences(remove_unused_columns=True)
# Initialize trainer (enable LoRA)
trainer = DNATrainer(
model=model,
config=configs,
datasets=sampled_datasets,
use_lora=True # Enable LoRA, will automatically read lora parameters from config
)
# Train
metrics = trainer.train()
print(f"LoRA training complete!")
print(f"Trainable parameters ratio: {trainer.get_trainable_parameters_ratio():.2%}")
# Save LoRA adapter
trainer.save_lora_adapter("./models/lora_adapter")
5.4 Loading LoRA Model for Inference¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm import DNAInference
# Load configuration
configs = load_config("./configs/lora_config.yaml")
# Load base model
model, tokenizer = load_model_and_tokenizer(
"kuleshov-group/PlantCAD2-Small-l24-d0768",
task_config=configs["task"],
source="huggingface"
)
# Load LoRA adapter
from peft import PeftModel
model = PeftModel.from_pretrained(model, "./models/lora_adapter")
model = model.merge_and_unload() # Merge LoRA weights
# Inference
inference_engine = DNAInference(config=configs, model=model, tokenizer=tokenizer)
result = inference_engine.infer("ATGCGT...")
6. Task 5: Model Inference and Mutagenesis Analysis¶
6.1 Inference Configuration¶
Create inference_config.yaml:
# File: configs/inference_config.yaml
inference:
batch_size: 16
device: "auto" # Auto-select GPU/CPU
max_length: 512
num_workers: 4
output_dir: "./results"
use_fp16: false
task:
num_labels: 2
task_type: binary
threshold: 0.5
6.2 Single Sequence Inference¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm import DNAInference
# Load configuration and model
configs = load_config("./configs/inference_config.yaml")
model, tokenizer = load_model_and_tokenizer(
"./models/final_model", # Use trained model
task_config=configs["task"]
)
# Initialize inference engine
inference_engine = DNAInference(config=configs, model=model, tokenizer=tokenizer)
# Single sequence inference
sequence = "ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC"
result = inference_engine.infer(sequence)
print(f"Inference result: {result}")
# Get probabilities
print(f"Negative class probability: {result['probabilities'][0]:.4f}")
print(f"Positive class probability: {result['probabilities'][1]:.4f}")
print(f"Predicted label: {result['predicted_label']}")
6.3 Batch Inference¶
from dnallm import DNAInference
import pandas as pd
# Load test data
test_data = pd.read_csv("./data/test.csv")
sequences = test_data["sequence"].tolist()
# Batch inference
results = inference_engine.batch_infer(sequences, show_progress=True)
# Save results
results_df = pd.DataFrame(results)
results_df.to_csv("./results/predictions.csv", index=False)
print(f"Batch inference complete! Processed {len(sequences)} sequences")
6.4 Mutagenesis Analysis (In-silico Mutagenesis)¶
Mutagenesis analysis is used to identify the contribution of each position in the sequence to the prediction.
from dnallm import load_config, load_model_and_tokenizer
from dnallm.inference import Mutagenesis
# Load configuration and model
configs = load_config("./configs/inference_config.yaml")
model, tokenizer = load_model_and_tokenizer(
"./models/final_model",
task_config=configs["task"]
)
# Initialize mutagenesis analyzer
mutagenesis = Mutagenesis(config=configs, model=model, tokenizer=tokenizer)
# Mutagenesis analysis
sequence = "ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC"
mutagenesis.mutate_sequence(sequence, replace_mut=True)
# Evaluate mutation effects
predictions = mutagenesis.evaluate(strategy="mean")
# Visualization
plot = mutagenesis.plot(predictions, save_path="./results/mutation_effects.pdf")
# Get important positions
important_positions = mutagenesis.get_important_positions(top_k=10)
print(f"Top 10 most important positions: {important_positions}")
6.5 Model Interpretability¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm.inference import DNAInterpreter
# Load configuration and model
configs = load_config("./configs/inference_config.yaml")
model, tokenizer = load_model_and_tokenizer(
"./models/final_model",
task_config=configs["task"]
)
# Initialize interpreter
interpreter = DNAInterpreter(config=configs, model=model, tokenizer=tokenizer)
# Get attention weights
sequence = "ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC"
attention_weights = interpreter.get_attention(sequence)
# Get embedding vectors
embeddings = interpreter.get_embedding(sequence)
# Generate comprehensive report
report = interpreter.generate_report(sequence)
print(report)
7. Task 6: MCP Service Deployment¶
7.1 MCP Service Overview¶
MCP (Model Context Protocol) is a standardized model service protocol that supports deploying DNA sequence analysis models as API services.
7.2 MCP Server Configuration¶
Create mcp_server_config.yaml:
# File: mcp_server/mcp_server_config.yaml
# Server configuration
server:
host: "0.0.0.0"
port: 8000
workers: 1
log_level: "INFO"
debug: false
# MCP metadata
mcp:
name: "DNA Sequence Analysis Server"
version: "0.1.0"
description: "DNA sequence analysis MCP server, supports promoter prediction, mutagenesis analysis, and more"
# Model configuration - one-to-many relationship
models:
promoter_model:
name: "promoter_model"
model_name: "Plant DNABERT BPE promoter"
config_path: "./inference_config.yaml"
enabled: true
priority: 1
ner_model:
name: "ner_model"
model_name: "Custom NER Model"
config_path: "./ner_inference_config.yaml"
enabled: true
priority: 2
# Multi-model combined analysis
multi_model:
comprehensive_analysis:
name: "comprehensive_analysis"
description: "Comprehensive sequence analysis"
models: ["promoter_model"]
enabled: true
# SSE configuration
sse:
heartbeat_interval: 30
max_connections: 100
connection_timeout: 300
enable_compression: true
mount_path: "/mcp"
cors_origins: ["*"]
enable_heartbeat: true
# Logging configuration
logging:
level: "INFO"
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
file: "./logs/mcp_server.log"
max_size: "10MB"
backup_count: 5
Create corresponding inference configuration file inference_config.yaml:
# File: mcp_server/inference_config.yaml
inference:
batch_size: 16
device: "auto"
max_length: 512
num_workers: 4
output_dir: ./results
use_fp16: false
task:
num_labels: 2
task_type: binary
threshold: 0.5
7.3 Start MCP Server¶
7.3.1 Using Command Line¶
# Start with configuration file
dnallm-mcp-server --config ./mcp_server/mcp_server_config.yaml
# Run in background
nohup dnallm-mcp-server --config ./mcp_server/mcp_server_config.yaml > ./logs/mcp_server.log 2>&1 &
7.3.2 Using Python Code¶
import asyncio
from dnallm.mcp import DNALLMMCPServer
async def main():
# Initialize server
server = DNALLMMCPServer("./mcp_server/mcp_server_config.yaml")
await server.initialize()
# Start server
print("🚀 MCP server starting...")
await server.start_server(host="0.0.0.0", port=8000, transport="sse")
print("✅ MCP server started: http://0.0.0.0:8000/mcp")
if __name__ == "__main__":
asyncio.run(main())
7.4 MCP Client Usage¶
7.4.1 Using curl for Testing¶
# Health check
curl http://localhost:8000/health
# Single sequence prediction
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"sequence": "ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC", "model": "promoter_model"}'
# SSE real-time prediction
curl -N http://localhost:8000/mcp/stream \
-H "Content-Type: application/json" \
-d '{"sequence": "ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC"}'
7.4.2 Python Client Example¶
import asyncio
import aiohttp
class DNALLMClient:
def __init__(self, base_url="http://localhost:8000"):
self.base_url = base_url
async def health_check(self):
"""Health check"""
async with aiohttp.ClientSession() as session:
async with session.get(f"{self.base_url}/health") as response:
return await response.json()
async def predict(self, sequence, model="promoter_model"):
"""Single sequence prediction"""
async with aiohttp.ClientSession() as session:
async with session.post(
f"{self.base_url}/predict",
json={"sequence": sequence, "model": model}
) as response:
return await response.json()
async def batch_predict(self, sequences, model="promoter_model"):
"""Batch prediction"""
async with aiohttp.ClientSession() as session:
async with session.post(
f"{self.base_url}/batch_predict",
json={"sequences": sequences, "model": model}
) as response:
return await response.json()
async def list_models(self):
"""List available models"""
async with aiohttp.ClientSession() as session:
async with session.get(f"{self.base_url}/models") as response:
return await response.json()
# Usage example
async def main():
client = DNALLMClient()
# Health check
health = await client.health_check()
print(f"Server status: {health}")
# Single sequence prediction
result = await client.predict(
"ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC"
)
print(f"Prediction result: {result}")
# Batch prediction
sequences = [
"ATGCGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC",
"GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA"
]
results = await client.batch_predict(sequences)
print(f"Batch prediction results: {results}")
asyncio.run(main())
7.5 Docker Deployment¶
Create Dockerfile:
FROM python:3.12-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Copy dependency files
COPY pyproject.toml ./
COPY README.md ./
# Install Python dependencies
RUN pip install --no-cache-dir uv && \
uv pip install -e '.[base,mcp,cuda124]' --system
# Copy application code
COPY dnallm/ ./dnallm/
COPY mcp_server/ ./mcp_server/
COPY models/ ./models/ # Pre-trained models
# Expose port
EXPOSE 8000
# Start command
CMD ["dnallm-mcp-server", "--config", "./mcp_server/mcp_server_config.yaml"]
Build and run:
# Build image
docker build -t dnallm-mcp-server .
# Run container
docker run -p 8000:8000 -v ./models:/app/models dnallm-mcp-server
8. Advanced Tips and Best Practices¶
8.1 Training Optimization Tips¶
8.1.1 Mixed Precision Training¶
finetune:
fp16: True # FP16 mixed precision
bf16: False # BF16 (newer GPUs)
# or
bf16: True # A100, H100, etc.
8.1.2 Gradient Accumulation¶
finetune:
per_device_train_batch_size: 4
gradient_accumulation_steps: 4 # Effective batch size = 4 * 4 = 16
8.1.3 Learning Rate Scheduling¶
finetune:
learning_rate: 2e-5
warmup_ratio: 0.1 # Warmup ratio
lr_scheduler_type: "cosine_with_restarts"
lr_scheduler_kwargs:
num_restarts: 2
8.2 Model Selection Guide¶
| Task Type | Recommended Model | Features |
|---|---|---|
| Classification/Feature Extraction | DNABERT, Nucleotide Transformer | Strong sequence context understanding |
| Sequence Generation | DNAGPT, Evo | Next token prediction |
| Long sequences (>5kb) | Caduceus, HyenaDNA | High efficiency, low memory usage |
| Multimodal | megaDNA, LucaOne | Multi-species, multimodal support |
8.3 Data Augmentation Strategies¶
from dnallm.datahandling import DNADataset
# Reverse complement augmentation
dataset.augment_reverse_complement()
# Random mutation
dataset.augment_random_mutation(rate=0.01)
# K-mer augmentation
dataset.augment_kmer(k=3)
8.4 Distributed Training¶
from dnallm.finetune import DNATrainer
from accelerate import Accelerator
# Initialize accelerator
accelerator = Accelerator(
mixed_precision="fp16",
gradient_accumulation_steps=4
)
# Prepare data and model
model, optimizer, dataloader = accelerator.prepare(
model, optimizer, dataloader
)
# Train
trainer = DNATrainer(
model=model,
config=configs,
datasets=datasets,
accelerator=accelerator
)
8.5 Model Saving and Loading¶
# Save full model
trainer.save_model("./models/full_model")
# Save weights only
trainer.save_pretrained("./models/weights_only")
# Save as Safetensors format (recommended, faster and safer)
trainer.save_model("./models/safetensors_model", safe_serialization=True)
# Load model
from dnallm import load_model_and_tokenizer
model, tokenizer = load_model_and_tokenizer(
"./models/final_model",
task_config=configs["task"]
)
9. Frequently Asked Questions¶
Q1: What to do if CUDA out of memory?¶
Solutions:
1. Decrease per_device_train_batch_size
2. Increase gradient_accumulation_steps
3. Enable fp16: true
4. Use gradient_checkpointing: true
finetune:
per_device_train_batch_size: 4
gradient_accumulation_steps: 8
fp16: true
gradient_checkpointing: true
Q2: What to do if model training doesn't converge?¶
Solutions: 1. Check data quality (are labels correct?) 2. Adjust learning rate (typically 1e-5 to 1e-4) 3. Increase training data 4. Use learning rate warmup
finetune:
learning_rate: 5e-5
warmup_ratio: 0.2
Q3: How to choose the right task type?¶
Decision Tree:
Task type selection:
├── Binary classification → task_type: "binary"
├── Multi-class classification (mutually exclusive) → task_type: "multiclass"
├── Multi-class classification (not mutually exclusive) → task_type: "multilabel"
├── Predict continuous values → task_type: "regression"
├── Sequence labeling → task_type: "token"
├── Feature extraction → task_type: "embedding"
├── Mask prediction → task_type: "mask"
└── Text generation → task_type: "generation"
Q4: How to resume training from checkpoint?¶
trainer = DNATrainer(
model=model,
config=configs,
datasets=datasets,
resume_from_checkpoint="./outputs/checkpoint-1000"
)
Q5: How to monitor training process?¶
finetune:
report_to: "wandb" # or "tensorboard"
# Logging configuration
logging_strategy: "steps"
logging_steps: 100
eval_strategy: "steps"
eval_steps: 500
Q6: What to do if MCP server fails to start?¶
Troubleshooting Steps:
1. Check if port is occupied: lsof -i :8000
2. Check if model file exists
3. View log file: cat ./logs/mcp_server.log
4. Verify configuration file syntax: python -c "import yaml; yaml.safe_load(open('config.yaml'))"
Related Resources¶
- API Documentation - Detailed API documentation
- Configuration Documentation - Configuration file details
- Model Selection Guide - How to choose the right model
- FAQ - More frequently asked questions
💡 Tip: All code in this tutorial has been tested. It is recommended to complete each task in order to gradually master the core functionality of DNALLM.