Sequence Analysis Methods¶
After training a model and making predictions, the next crucial step is to analyze the results to gain biological insights. DNALLM provides tools to facilitate several key types of sequence analysis.
1. Functional Analysis¶
Functional analysis aims to understand what a DNA sequence does. This is often the direct output of classification or regression tasks.
- What it is: Assigning a functional label (e.g., "promoter," "enhancer") or a quantitative value (e.g., expression level) to a sequence.
- How it's done in DNALLM: This is the primary goal of the
DNAInference
andBenchmark
classes. You provide a sequence, and the model predicts its function based on what it learned during training.
Example:
# Using DNAInference to predict promoter strength (a regression task)
inference_result = inference_engine.infer(sequence="GATTACA...")
print(f"Predicted Promoter Strength: {inference_result[0]['score']}")
2. Key Site Identification (In Silico Mutagenesis)¶
This is one of the most powerful analysis methods. It helps identify which specific nucleotides within a sequence are most critical for its function.
- What it is: Systematically mutating each position in a sequence and measuring the impact of that mutation on the model's prediction. A large change in the prediction score indicates a functionally important site. This is a computational proxy for saturation mutagenesis experiments.
- How it's done in DNALLM: The
dnallm.Mutagenesis
class is designed specifically for this purpose. It automates the process of creating mutations, running inference, and calculating the effect of each mutation.
Example:
from dnallm import Mutagenesis
# Initialize mutagenesis analyzer
mutagenesis = Mutagenesis(config=configs, model=model, tokenizer=tokenizer)
# Generate mutations and evaluate their effects
mutagenesis.mutate_sequence(sequence, replace_mut=True)
predictions = mutagenesis.evaluate(strategy="mean")
# The 'predictions' dictionary now contains the effect of every single-base
# substitution on the model's output.
The results can be visualized as a heatmap, clearly showing "hotspots" of functional importance.
!Mutation Heatmap
An example mutation effect plot generated by mutagenesis.plot()
.
3. Model Interpretability Analysis¶
This analysis focuses on understanding how the model makes its decisions, rather than just what the decision is.
Attention Visualization¶
- What it is: For Transformer-based models, attention mechanisms weigh the importance of different tokens when making a prediction for a specific token. Visualizing these attention weights can show which parts of a sequence the model "focuses on."
- How it's done in DNALLM: The
DNAInference.plot_attentions()
method can be used to generate heatmaps of attention scores between tokens.
Embedding Visualization¶
- What it is: A language model converts a sequence into a high-dimensional numerical vector called an embedding. By using dimensionality reduction techniques (like t-SNE or PCA), we can visualize these embeddings in 2D or 3D. This can reveal whether the model has learned to group sequences with similar functions together in the embedding space.
- How it's done in DNALLM: The
DNAInference.plot_hidden_states()
method generates these visualizations.
These analysis methods, supported by DNALLM, allow researchers to move beyond simple predictions and gain deeper, more mechanistic insights into the function of DNA sequences and the behavior of the models themselves.