Skip to content

Common Biological Tasks with DNALLM

DNA Language Models can be applied to a wide variety of computational biology problems. These tasks often involve predicting the function or properties of a DNA sequence. DNALLM is designed to handle these tasks through its flexible configuration system.

Here are some of the most common tasks, mapped to their corresponding task_type in DNALLM.

1. Sequence Classification

This is the most common category of tasks, where the goal is to assign a label to a given DNA sequence.

Binary Classification

  • task_type: binary
  • Description: Predict whether a sequence belongs to one of two classes.
  • Examples:
    • Promoter Prediction: Is this sequence a promoter or not?
    • Enhancer Identification: Is this sequence an enhancer or a non-enhancer region?
    • Splice Site Prediction: Is this position a splice site (donor/acceptor) or not?

Multi-class Classification

  • task_type: multiclass
  • Description: Assign a sequence to one of several mutually exclusive classes.
  • Examples:
    • Functional Region Classification: Classify a sequence as a promoter, enhancer, or silencer.
    • Organism of Origin: Predict whether a viral sequence comes from human, bat, or avian hosts.

Multi-label Classification

  • task_type: multilabel
  • Description: Assign a sequence to one or more non-exclusive labels.
  • Examples:
    • Transcription Factor Binding: Predict which of several transcription factors (e.g., TCF1, GATA3, RUNX1) can bind to a given sequence.

2. Expression Prediction (Regression)

  • task_type: regression
  • Description: Predict a continuous numerical value associated with a sequence.
  • Examples:
    • Promoter Strength Prediction: Predict the level of gene expression driven by a promoter sequence.
    • Protein-DNA Binding Affinity: Predict the binding strength of a transcription factor to a DNA sequence.

3. Element Mining (Token Classification)

  • task_type: token_classification (also known as Named Entity Recognition or NER)
  • Description: Assign a label to each token (or nucleotide) within a sequence.
  • Examples:
    • Transcription Factor Binding Site (TFBS) Identification: Pinpoint the exact locations of TFBS motifs within a longer regulatory sequence.
    • Gene Finding: Identify the start codons, stop codons, and exon/intron boundaries within a genomic region.

4. New Sequence Generation

  • task_type: generation
  • Description: Create novel DNA sequences that have desired properties. This is typically done with Causal Language Models (CLMs) like GPT or Evo.
  • Examples:
    • Designing High-Strength Promoters: Generate new promoter sequences that are predicted to drive very high levels of gene expression.
    • Creating Synthetic Genes: Design novel genes with specific desired functions.

These tasks form the core of what DNALLM is designed to accomplish. By providing a unified interface for fine-tuning and inference, DNALLM allows researchers to easily apply state-of-the-art language models to these and other biological challenges.


Next: Explore the methods used to analyze the results of these tasks in Sequence Analysis Methods.