Getting Started with Fine-tuning¶
This guide will walk you through the basics of fine-tuning DNA language models using DNALLM. You'll learn how to set up your first fine-tuning experiment, configure models and datasets, and monitor training progress.
Overview¶
Fine-tuning in DNALLM allows you to: - Adapt pre-trained DNA language models to your specific tasks - Leverage transfer learning for better performance on small datasets - Customize models for domain-specific DNA analysis - Achieve state-of-the-art results with minimal data
Prerequisites¶
Ensure you have the following installed and configured:
# Install DNALLM
pip install dnallm
# Or with uv (recommended)
uv pip install dnallm
# Install additional dependencies for fine-tuning
pip install torch transformers datasets accelerate
Basic Setup¶
1. Import Required Modules¶
from dnallm import load_config, load_model_and_tokenizer, DNADataset, DNATrainer
from transformers import TrainingArguments
import torch
2. Create a Simple Configuration¶
Create a finetune_config.yaml
file:
# finetune_config.yaml
task:
task_type: "binary" # binary, multiclass, multilabel, regression, generation, mask, token
num_labels: 2
label_names: ["negative", "positive"]
threshold: 0.5
finetune:
output_dir: "./outputs"
num_train_epochs: 3
per_device_train_batch_size: 8
per_device_eval_batch_size: 16
gradient_accumulation_steps: 1
learning_rate: 2e-5
weight_decay: 0.01
warmup_ratio: 0.1
logging_strategy: "steps"
logging_steps: 100
eval_strategy: "steps"
eval_steps: 100
save_strategy: "steps"
save_steps: 500
save_total_limit: 3
load_best_model_at_end: true
metric_for_best_model: "eval_loss"
report_to: "tensorboard"
seed: 42
bf16: false
fp16: false
3. Load Your Data¶
# Load your dataset
dataset = DNADataset.load_local_data(
"path/to/your/data.csv",
seq_col="sequence",
label_col="label",
max_length=512
)
# Split data into train/validation sets
if not dataset.is_split:
dataset.split_data(test_size=0.2, val_size=0.1)
print(f"Training samples: {len(dataset.train_data)}")
print(f"Validation samples: {len(dataset.val_data)}")
print(f"Test samples: {len(dataset.test_data)}")
4. Load Pre-trained Model¶
# Load configuration
config = load_config("finetune_config.yaml")
# Load pre-trained model and tokenizer
model, tokenizer = load_model_and_tokenizer(
"zhangtaolab/plant-dnabert-BPE",
task_config=config['task'],
source="huggingface"
)
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
print(f"Model loaded on device: {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
5. Initialize Trainer and Start Training¶
# Initialize trainer
trainer = DNATrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset.train_data,
eval_dataset=dataset.val_data,
config=config
)
# Start training
print("Starting fine-tuning...")
trainer.train()
# Save the final model
trainer.save_model("./final_model")
tokenizer.save_pretrained("./final_model")
print("Training completed! Model saved to ./final_model")
Command Line Interface¶
DNALLM also provides a convenient command-line interface:
# Basic fine-tuning run
dnallm-finetune --config finetune_config.yaml --model zhangtaolab/plant-dnabert-BPE --dataset path/to/data.csv
# Fine-tune with custom parameters
dnallm-finetune --config config.yaml --epochs 5 --batch-size 16 --learning-rate 1e-4
# Resume from checkpoint
dnallm-finetune --config config.yaml --resume-from-checkpoint ./checkpoint-1000
Understanding the Configuration¶
Task Configuration¶
The task
section defines what type of task you're fine-tuning for:
task:
task_type: "binary" # Task type (see table below)
num_labels: 2 # Number of output classes
label_names: ["neg", "pos"] # Human-readable label names
threshold: 0.5 # Classification threshold
Task Type | Description | Output |
---|---|---|
binary |
Binary classification | Single probability (0-1) |
multiclass |
Multi-class classification | Probability distribution |
multilabel |
Multi-label classification | Multiple binary outputs |
regression |
Continuous value prediction | Single real number |
generation |
Sequence generation | Generated text |
mask |
Masked language modeling | Predicted tokens |
token |
Token classification | Labels per token |
Training Configuration¶
The finetune
section controls training parameters:
finetune:
# Basic training settings
num_train_epochs: 3 # Total training epochs
per_device_train_batch_size: 8 # Batch size per device
per_device_eval_batch_size: 16 # Evaluation batch size
# Optimization
learning_rate: 2e-5 # Learning rate
weight_decay: 0.01 # Weight decay
warmup_ratio: 0.1 # Warmup proportion
# Training strategy
gradient_accumulation_steps: 1 # Gradient accumulation
max_grad_norm: 1.0 # Gradient clipping
# Monitoring and saving
logging_strategy: "steps" # When to log
logging_steps: 100 # Log every N steps
eval_strategy: "steps" # When to evaluate
eval_steps: 100 # Evaluate every N steps
save_strategy: "steps" # When to save
save_steps: 500 # Save every N steps
Data Format Requirements¶
Your dataset should be in one of these formats:
CSV/TSV Format¶
sequence,label
ATCGATCGATCG,1
GCTAGCTAGCTA,0
TATATATATATA,1
JSON Format¶
[
{"sequence": "ATCGATCGATCG", "label": 1},
{"sequence": "GCTAGCTAGCTA", "label": 0}
]
FASTA Format¶
>sequence1|label:1
ATCGATCGATCG
>sequence2|label:0
GCTAGCTAGCTA
Example: Complete Fine-tuning Workflow¶
Here's a complete working example:
import os
from dnallm import load_config, load_model_and_tokenizer, DNADataset, DNATrainer
def run_finetuning():
# 1. Check data availability
data_path = "path/to/your/dna_sequences.csv"
if not os.path.exists(data_path):
print("Please provide a valid data path")
return
# 2. Load configuration
config = load_config("finetune_config.yaml")
# 3. Load and prepare dataset
dataset = DNADataset.load_local_data(
data_path,
seq_col="sequence",
label_col="label",
max_length=512
)
# Split data
if not dataset.is_split:
dataset.split_data(test_size=0.2, val_size=0.1)
print(f"Dataset loaded: {len(dataset.train_data)} train, {len(dataset.val_data)} val")
# 4. Load pre-trained model
model, tokenizer = load_model_and_tokenizer(
"zhangtaolab/plant-dnabert-BPE",
task_config=config['task'],
source="huggingface"
)
# 5. Initialize trainer
trainer = DNATrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset.train_data,
eval_dataset=dataset.val_data,
config=config
)
# 6. Start training
print("Starting fine-tuning...")
trainer.train()
# 7. Evaluate on test set
test_results = trainer.evaluate(dataset.test_data)
print(f"Test results: {test_results}")
# 8. Save model
output_dir = "./finetuned_model"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Fine-tuning completed! Model saved to {output_dir}")
return output_dir
# Run the complete workflow
if __name__ == "__main__":
model_path = run_finetuning()
Monitoring Training Progress¶
TensorBoard Integration¶
DNALLM automatically logs training metrics to TensorBoard:
# Start TensorBoard
tensorboard --logdir ./outputs
# Open in browser: http://localhost:6006
Key Metrics to Monitor¶
- Training Loss: Should decrease over time
- Validation Loss: Should decrease but not overfit
- Learning Rate: Should follow the scheduled curve
- Gradient Norm: Should be stable (around 1.0)
- Memory Usage: Monitor GPU memory consumption
Early Stopping¶
Configure early stopping to prevent overfitting:
finetune:
# ... other settings ...
early_stopping_patience: 3
early_stopping_threshold: 0.001
metric_for_best_model: "eval_loss"
greater_is_better: false
Common Hyperparameters¶
Learning Rate¶
- Conservative: 1e-5 to 5e-5 (good for most cases)
- Aggressive: 5e-5 to 1e-4 (when you have more data)
- Very Small: 1e-6 to 1e-5 (when fine-tuning on very similar data)
Batch Size¶
- Small: 4-8 (when memory is limited)
- Medium: 8-16 (good balance)
- Large: 16-32 (when you have sufficient memory)
Training Epochs¶
- Short: 1-3 epochs (when data is similar to pre-training)
- Medium: 3-10 epochs (typical fine-tuning)
- Long: 10+ epochs (when data is very different)
Next Steps¶
After completing this basic tutorial:
- Explore Task-Specific Guides: Learn about different task types
- Advanced Techniques: Discover custom training strategies
- Configuration Options: Check detailed configuration options
- Real-world Examples: See practical use cases
Troubleshooting¶
Common Issues¶
"CUDA out of memory" error
# Reduce batch size
finetune:
per_device_train_batch_size: 4 # Reduced from 8
gradient_accumulation_steps: 2 # Compensate for smaller batch
Training loss not decreasing
# Adjust learning rate
finetune:
learning_rate: 5e-5 # Increased from 2e-5
warmup_ratio: 0.2 # Increased warmup
Overfitting (validation loss increases)
# Add regularization
finetune:
weight_decay: 0.1 # Increased from 0.01
dropout: 0.2 # Add dropout
Additional Resources¶
- Task-Specific Guides - Fine-tuning for different tasks
- Advanced Techniques - Custom training and optimization
- Configuration Guide - Detailed configuration options
- Examples and Use Cases - Real-world scenarios
- Troubleshooting - Common problems and solutions
Ready for more? Continue to Task-Specific Guides to learn about fine-tuning for different types of DNA analysis tasks.