Using Caduceus Models in DNALLM¶

Caduceus is a family of bi-directional and equivariant models designed specifically for long-range DNA sequence modeling. It introduces architectural innovations to handle the unique symmetries of DNA, such as reverse-complement equivariance, making it particularly powerful for genomics.

DNALLM Examples: Caduceus-Ph, Caduceus-PS, PlantCaduceus, PlantCAD2

1. Architecture Overview¶

Caduceus models are built on a custom architecture that modifies the standard Transformer to better suit DNA.

Reverse-Complement Equivariance: The model is designed to produce equivalent representations for a DNA sequence and its reverse complement. This is a natural inductive bias for DNA, as functionality is often preserved in both strands.
Bi-directional Long-Range Modeling: It processes sequences bi-directionally and is optimized to handle very long DNA contexts, which is essential for capturing distal regulatory elements.
Masked Language Modeling: Like BERT, Caduceus is pre-trained using a masked language modeling objective, where it learns to predict masked nucleotides within a long sequence.

These features make Caduceus highly effective for tasks requiring an understanding of long-range dependencies in genomes.

2. Environment and Installation¶

Caduceus models are supported by the standard transformers library and do not require any special dependencies beyond the core DNALLM installation.

Installation¶

A standard DNALLM installation is sufficient.

# Install DNALLM with core dependencies
pip install dnallm

3. Model Loading and Configuration¶

You can load a Caduceus model using the AutoModel classes from transformers or the DNALLM utility functions.

Loading a Model¶

Here’s how to load a Caduceus model for a masked language modeling task.

from dnallm.utils.load import load_model_and_tokenizer

# Use a specific Caduceus model
model_name = "kuleshov-group/caduceus-ph_seqlen-131k_d_model-256_n_layer-16"

# Load model and tokenizer
model, tokenizer = load_model_and_tokenizer(
    model_name_or_path=model_name
)

print("Model:", type(model))
print("Tokenizer:", type(tokenizer))

4. Inference Example¶

Let's use a Caduceus model to get embeddings for a DNA sequence.

import torch
from dnallm.utils.load import load_model_and_tokenizer

# 1. Load the pre-trained model and tokenizer
model_name = "kuleshov-group/PlantCaduceus_l20"
model, tokenizer = load_model_and_tokenizer(model_name)
model.eval()

# 2. Prepare and tokenize the DNA sequence
dna_sequence = "GATTACAGATTACAGATTACAGATTACAGATTACAGATTACA"
inputs = tokenizer(dna_sequence, return_tensors="pt")

# 3. Perform inference
with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state
print("Shape of embeddings:", embeddings.shape)