Common Workflows in DNALLM¶
DNALLM is designed to streamline common tasks in computational genomics. This guide covers three primary workflows: fine-tuning a model, performing inference, and benchmarking multiple models.
1. Fine-tuning a Model¶
Fine-tuning adapts a pre-trained language model to a specific downstream task, such as classifying promoter sequences.
Workflow Steps¶
- Prepare a Configuration File: Define the model, dataset, and training parameters in a
.yaml
file. - Load Data: Use the
DNADataset
class to load and preprocess your training data. - Load Model: Load a pre-trained model and tokenizer.
- Initialize Trainer: Create a
DNATrainer
instance with your configuration, model, and data. - Start Training: Call the
train()
method.
Example¶
This example fine-tunes plant-dnabert-BPE
for a binary classification task.
from dnallm import load_config, load_model_and_tokenizer
from dnallm.datahandling import DNADataset
from dnallm.finetune import DNATrainer
# 1. Load configuration from a file
configs = load_config("./example/notebooks/finetune_binary/finetune_config.yaml")
# 2. Load model and tokenizer
model_name = "zhangtaolab/plant-dnabert-BPE"
model, tokenizer = load_model_and_tokenizer(
model_name,
task_config=configs["task"],
source="huggingface"
)
# 3. Prepare dataset
dataset = DNADataset.load_local_data(
file_paths="./tests/test_data/binary_classification/train.csv",
seq_col="sequence",
label_col="label",
tokenizer=tokenizer,
)
dataset.encode_sequences() # Tokenize the sequences
# 4. Initialize the trainer
trainer = DNATrainer(
config=configs,
model=model,
datasets=dataset
)
# 5. Start the fine-tuning process
trainer.train()
2. In-silico Mutagenesis Analysis¶
This workflow systematically introduces mutations into a sequence and evaluates their impact on the model's prediction, which is useful for identifying important nucleotides.
Workflow Steps¶
- Load a Fine-tuned Model: Use a model that has been trained for a specific task (e.g., predicting promoter strength).
- Initialize
Mutagenesis
: Create an instance of theMutagenesis
analyzer. - Generate Mutations: Use
mutate_sequence()
to create all possible single-nucleotide substitutions. - Evaluate Effects: Run inference on all mutated sequences.
- Visualize Results: Plot the mutation effects to create a saliency map.
Example¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm.inference import Mutagenesis
# 1. Load configuration and a fine-tuned model
configs = load_config("./example/notebooks/in_silico_mutagenesis/inference_config.yaml")
model_name = "zhangtaolab/plant-dnagpt-BPE-promoter_strength_protoplast"
model, tokenizer = load_model_and_tokenizer(model_name, task_config=configs["task"])
# 2. Initialize the mutagenesis analyzer
mutagenesis = Mutagenesis(config=configs, model=model, tokenizer=tokenizer)
# 3. Generate and evaluate mutations for a sequence
sequence = "AATATATTTAATCGGTGTATAATTTCTGTGAAGATCCTCGATACTTCATATAAGAGATTTTGAGAGAGAGAGAGAACCAATTTTCGAATGGGTGAGTTGGCAAAGTATTCACTTTTCAGAACATAATTGGGAAACTAGTCACTTTACTATTCAAAATTTGCAAAGTAGTC"
mutagenesis.mutate_sequence(sequence, replace_mut=True)
predictions = mutagenesis.evaluate(strategy="mean")
# 4. Plot and save the results
plot = mutagenesis.plot(predictions, save_path="mutation_effects.pdf")
print("Mutation analysis complete. Plot saved to mutation_effects.pdf")
For more workflows, such as benchmarking and embedding extraction, explore the Tutorials section.