Basic Inference with DNALLM¶
This tutorial walks you through the complete process of running inference using the DNAInference
engine. We will cover loading a model, preparing data, and making predictions on both individual sequences and files.
1. The Core Workflow¶
The inference process in DNALLM follows these steps:
1. Load Configuration: Read the inference_config.yaml
file.
2. Load Model & Tokenizer: Fetch a pre-trained model and its corresponding tokenizer.
3. Initialize DNAInference
: Create an inference engine instance with the model, tokenizer, and config.
4. Run Inference: Use the engine's infer()
method to get predictions.
5. Interpret Results: Analyze the output.
2. A Complete Example¶
Let's put everything together in a Python script. This example demonstrates loading a promoter prediction model and using it to classify DNA sequences.
import os
from dnallm import load_config, load_model_and_tokenizer, DNAInference
def main():
# 1. Load Configuration
# Assumes 'inference_config.yaml' is in the same directory
try:
configs = load_config("inference_config.yaml")
except FileNotFoundError:
print("Error: 'inference_config.yaml' not found. Please create it.")
return
# 2. Load Model and Tokenizer
# This example uses a model from ModelScope. You can also use 'huggingface'.
model_name = "zhangtaolab/plant-dnagpt-BPE-promoter"
print(f"Loading model '{model_name}'...")
model, tokenizer = load_model_and_tokenizer(
model_name,
task_config=configs['task'],
source="modelscope"
)
# 3. Initialize DNAInference Engine
print("Initializing inference engine...")
inference_engine = DNAInference(
model=model,
tokenizer=tokenizer,
config=configs
)
# --- 4. Run Inference ---
# Example 1: Infer from a list of sequences
print("\n--- Predicting from a list of sequences ---")
seqs_list = [
"GCACTTTACTTAAAGTAAAAAGAAAAAAACTGTGCGCTCTCCAACTACCGCAGCAACGTGTCGAGCACAGGAACACGTGTCACTTCAGTTCTTCCAATTGCTGGGGCCCACCACTGTTTACTTCTGTACAGGCAGGTGGCCATGCTGATGACACTCCACACTCCTCGACTTTCGTAGCAGCAAGCCACGCGTGACCGAGAAGCCTCGCG",
"TTGTCATCACATTTGATCAACTACGATTTATGTTGTACTATTCATCTGTTTTCTCCTTTTTTTTTCCCTTATTGACAGGTTGTGGAGGTTCACAACGAACAGAATACAAGAAATTTTGGTAATCATTTGAGGACTTTCATGGGGTATGAATTGTGTGCTATAATAAATTAA"
]
results_from_list = inference_engine.infer(sequences=seqs_list)
print("Results:")
print(results_from_list)
# Example 2: Infer from a file
print("\n--- Predicting from a file ---")
# Create a dummy CSV file for demonstration
seq_file = 'test_data.csv'
with open(seq_file, 'w') as f:
f.write("sequence,label\n")
f.write(f"{seqs_list[0]},1\n")
f.write(f"{seqs_list[1]},0\n")
# Run inference and evaluation
try:
results_from_file, metrics = inference_engine.infer(
file_path=seq_file,
evaluate=True, # Enable evaluation since the file has labels
label_col='label' # Specify the column containing labels
)
print("\nResults from file (first 2 entries):")
print({k: results_from_file[k] for k in list(results_from_file)[:2]})
print("\nEvaluation Metrics:")
print(metrics)
except FileNotFoundError:
print(f"Error: The file '{seq_file}' was not found.")
finally:
# Clean up the dummy file
if os.path.exists(seq_file):
os.remove(seq_file)
if __name__ == "__main__":
main()
After create the inference.py
script, run the following code to do inference:
python inference.py
A user-friendly Jupyter Notebook is also provided: example/notebooks/inference/inference.ipynb.
3. Understanding the Output¶
The infer()
method returns a dictionary where each key is the index of a sequence and the value contains its prediction details.
{
"0": {
"sequence": "GCACTTTACTTAAAGTA...",
"label": "positive",
"scores": {
"negative": 0.02738,
"positive": 0.97261
}
},
"1": {
"sequence": "TTGTCATCACATTTGAT...",
"label": "negative",
"scores": {
"negative": 0.99983,
"positive": 0.00016
}
}
}
sequence
: The input DNA sequence (ifkeep_seqs
isTrue
during data loading).label
: The final predicted label, based on thetask.threshold
from your config. For a binary task, this would be one of thelabel_names
.scores
: The raw probabilities for each class. This gives you a measure of the model's confidence.
If evaluate=True
, a second dictionary containing performance metrics (like accuracy, F1-score, AUROC) is also returned.
4. Best Practices and Performance¶
Error Handling¶
- FileNotFoundError: Always wrap file-based inference in a
try...except
block to handle cases where the input file doesn't exist. - OutOfMemoryError: If you get a CUDA out-of-memory error, the primary solution is to reduce
batch_size
in yourinference_config.yaml
.
Performance Optimization¶
- Use a GPU: For any serious workload, a GPU is essential. Set
device: auto
ordevice: cuda
. - Tune
batch_size
: Find the largestbatch_size
that fits in your GPU memory to maximize throughput. - Enable FP16/BF16: If you have a modern NVIDIA GPU (Ampere architecture or newer), setting
use_fp16: true
oruse_bf16: true
can provide a significant speedup with minimal impact on accuracy. - Increase
num_workers
: If you notice your GPU is often waiting for data, increasingnum_workers
can help speed up data loading, especially for large files.
5. Common Questions (FAQ)¶
Q: Why are my predictions all the same? A: This can happen if the model is not well-suited for your data or if the input sequences are too different from what it was trained on. Check that the model you loaded is appropriate for your task.
Q: How do I get hidden states or attention weights for model interpretability?
A: The infer()
method has output_hidden_states=True
and output_attentions=True
flags. Setting these will return embeddings and attention scores, which can be accessed via inference_engine.embeddings
. Be aware that this consumes a large amount of memory.
Q: Can I run inference on a FASTA file?
A: Yes. The infer_file
method automatically handles .fasta
, .fa
, .csv
, .tsv
, and .txt
files. For FASTA, the sequence is read directly. For CSV/TSV, you must specify the seq_col
. Other structured formats such as pickle, arrow, parquet, etc. are also supported.