Common Biological Tasks with DNALLM¶

DNA Language Models can be applied to a wide variety of computational biology problems. These tasks often involve predicting the function or properties of a DNA sequence. DNALLM is designed to handle these tasks through its flexible configuration system.

Here are some of the most common tasks, mapped to their corresponding task_type in DNALLM.

1. Sequence Classification¶

This is the most common category of tasks, where the goal is to assign a label to a given DNA sequence.

Binary Classification¶

task_type: binary
Description: Predict whether a sequence belongs to one of two classes.
Examples:
- Promoter Prediction: Is this sequence a promoter or not?
- Enhancer Identification: Is this sequence an enhancer or a non-enhancer region?
- Splice Site Prediction: Is this position a splice site (donor/acceptor) or not?

Multi-class Classification¶

task_type: multiclass
Description: Assign a sequence to one of several mutually exclusive classes.
Examples:
- Functional Region Classification: Classify a sequence as a promoter, enhancer, or silencer.
- Organism of Origin: Predict whether a viral sequence comes from human, bat, or avian hosts.

Multi-label Classification¶

task_type: multilabel
Description: Assign a sequence to one or more non-exclusive labels.
Examples:
- Transcription Factor Binding: Predict which of several transcription factors (e.g., TCF1, GATA3, RUNX1) can bind to a given sequence.

2. Expression Prediction (Regression)¶

task_type: regression
Description: Predict a continuous numerical value associated with a sequence.
Examples:
- Promoter Strength Prediction: Predict the level of gene expression driven by a promoter sequence.
- Protein-DNA Binding Affinity: Predict the binding strength of a transcription factor to a DNA sequence.

3. Element Mining (Token Classification)¶

task_type: token_classification (also known as Named Entity Recognition or NER)
Description: Assign a label to each token (or nucleotide) within a sequence.
Examples:
- Transcription Factor Binding Site (TFBS) Identification: Pinpoint the exact locations of TFBS motifs within a longer regulatory sequence.
- Gene Finding: Identify the start codons, stop codons, and exon/intron boundaries within a genomic region.

4. New Sequence Generation¶

task_type: generation
Description: Create novel DNA sequences that have desired properties. This is typically done with Causal Language Models (CLMs) like GPT or Evo.
Examples:
- Designing High-Strength Promoters: Generate new promoter sequences that are predicted to drive very high levels of gene expression.
- Creating Synthetic Genes: Design novel genes with specific desired functions.

These tasks form the core of what DNALLM is designed to accomplish. By providing a unified interface for fine-tuning and inference, DNALLM allows researchers to easily apply state-of-the-art language models to these and other biological challenges.

Next: Explore the methods used to analyze the results of these tasks in Sequence Analysis Methods.