Case Study: Promoter Prediction with DNALLM¶
This case study demonstrates how to use DNALLM to train a model for promoter prediction, a fundamental task in genomics. We will frame this as a binary classification problem: distinguishing promoter sequences from non-promoter sequences.
1. Case Background¶
A promoter is a region of DNA that initiates the transcription of a particular gene. Identifying promoters is crucial for understanding gene regulation and function. In this example, we will fine-tune a pre-trained DNA foundation model to classify sequences as either "promoter" (positive class) or "non-promoter" (negative class).
2. Code¶
This section provides a complete Python script to perform fine-tuning and inference for promoter prediction. The workflow consists of:
1. Loading a configuration file.
2. Loading a pre-trained model and tokenizer (e.g., Plant-DNABERT).
3. Loading and processing a dataset of DNA sequences with binary labels.
4. Initializing and running the DNATrainer
to fine-tune the model.
5. Running inference on the test set to evaluate performance.
Setup¶
First, ensure you have a YAML configuration file (finetune_config.yaml
) and your dataset.
finetune_config.yaml
:
# task configuration
task:
task_type: "sequence_classification"
num_labels: 2
label_map:
0: "non-promoter"
1: "promoter"
# training configuration
training:
output_dir: "./outputs"
num_train_epochs: 3
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
warmup_steps: 500
weight_decay: 0.01
logging_dir: "./logs"
logging_steps: 10
evaluation_strategy: "steps"
save_steps: 100
eval_steps: 100
load_best_model_at_end: True
metric_for_best_model: "f1"
greater_is_better: True
Dataset: The model expects a dataset (e.g., from Hugging Face, ModelScope, or a local CSV/TSV file) with at least two columns: one for the DNA sequences and one for the corresponding labels (0 or 1).
Python Script¶
import os
from dnallm import load_config, load_model_and_tokenizer, DNADataset, DNATrainer
# --- 1. Load Configuration ---
# Load settings from the YAML file
configs = load_config("./finetune_config.yaml")
# --- 2. Load Model and Tokenizer ---
# Specify the pre-trained model to use. You can choose from ModelScope or Hugging Face.
# For this example, we use a plant-specific BERT model.
model_name = "zhangtaolab/plant-dnabert-BPE"
# Load the model for sequence classification and its corresponding tokenizer
# The `task_config` provides the model with the number of labels.
model, tokenizer = load_model_and_tokenizer(
model_name,
task_config=configs['task'],
source="modelscope" # or "huggingface"
)
# --- 3. Load and Prepare Dataset ---
# Load a dataset from ModelScope Hub.
# Replace with your dataset if needed.
data_name = "zhangtaolab/plant-multi-species-core-promoters"
# Create a DNADataset object
# `seq_col` and `label_col` specify the column names for sequences and labels.
datasets = DNADataset.from_modelscope(
data_name,
seq_col="sequence",
label_col="label",
tokenizer=tokenizer,
max_length=512
)
# Tokenize the sequences
datasets.encode_sequences()
# For demonstration purposes, we'll use a small sample of the data.
# Remove the next line to train on the full dataset.
sampled_datasets = datasets.sampling(0.05, overwrite=True)
# --- 4. Fine-tune the Model ---
# Initialize the trainer with the model, configs, and datasets
trainer = DNATrainer(
model=model,
config=configs,
datasets=sampled_datasets
)
# Start the training process
print("Starting model fine-tuning...")
train_metrics = trainer.train()
print("Training finished.")
print("Training Metrics:", train_metrics)
# --- 5. Evaluate the Model ---
# Run inference on the test set to get performance metrics
print("Evaluating model on the test set...")
test_metrics = trainer.infer()
print("Evaluation finished.")
print("Test Metrics:", test_metrics)
3. Expected Results¶
After running the script, the training process will output logs showing the loss at each step. Upon completion, the train_metrics
and test_metrics
dictionaries will be printed.
The test_metrics
dictionary will contain key performance indicators for binary classification, such as:
- test_accuracy
: The proportion of correctly classified sequences.
- test_precision
: The ability of the model to avoid false positives.
- test_recall
: The ability of the model to find all true positive sequences.
- test_f1
: The harmonic mean of precision and recall, providing a single score to balance them.
- test_AUROC
: The Area Under the Receiver Operating Characteristic Curve, which measures the model's ability to distinguish between classes.
- test_AUPRC
: The Area Under the Precision-Recall Curve, which is especially useful for imbalanced datasets.
Example output:
{
'test_loss': 0.547,
'test_accuracy': 0.740,
'test_precision': 0.721,
'test_recall': 0.808,
'test_f1': 0.762,
'test_mcc': 0.482,
'test_AUROC': 0.821,
'test_AUPRC': 0.810,
...
}
output_dir
in the config file.
4. Tuning Strategies¶
To improve model performance, consider the following strategies:
- Learning Rate: The default learning rate is
5e-5
. If the model is not converging, you can try adjusting it in thetraining
section of the config file (e.g.,learning_rate: 3e-5
). A good starting point for fine-tuning is often between1e-5
and5e-5
. - Batch Size:
per_device_train_batch_size
can be increased if you have more GPU memory. Larger batch sizes can lead to more stable training. - Epochs: The
num_train_epochs
determines how many times the model sees the entire training dataset. If the model is underfitting, increase the number of epochs. If it is overfitting (validation loss increases), consider reducing it or using early stopping. - Model Choice: DNALLM supports various models. A larger or more domain-specific model (e.g., one pre-trained on plant genomes) might yield better results for this task.
- Max Sequence Length: Adjust
max_length
during dataset loading based on the typical length of your promoter sequences. Longer sequences require more memory.
5. Troubleshooting¶
CUDA out of memory
: This is a common issue. The primary solution is to decreaseper_device_train_batch_size
andper_device_eval_batch_size
in yourfinetune_config.yaml
. You can also reducemax_length
if your sequences are very long.- Slow Training: Training large models on large datasets takes time. To speed up development, use the
.sampling()
method on your dataset to test your pipeline on a smaller subset first. For actual training, using a GPU is highly recommended. - Low Performance (Low F1/Accuracy):
- Ensure your data is clean and correctly labeled.
- Try tuning the hyperparameters mentioned above (learning rate, epochs, etc.).
- Consider if the pre-trained model is a good fit for your specific type of data (e.g., human, plant, bacteria). You may need a different base model.
- Check for severe class imbalance in your dataset. If one class is much rarer than the other, metrics like
AUPRC
are more informative thanaccuracy
. You may need to use techniques like over-sampling or using class weights.