Prediction Data Preparation¶

This notebook demonstrates how to prepare data for inference/prediction tasks.

Overview¶

For inference tasks, you need to prepare your input sequences in the correct format expected by the DNALLM inference pipeline.

Expected Input Format¶

DNALLM inference accepts:

Plain text files (.txt, .fasta) with one sequence per line
CSV files with sequence and label columns
JSON files with structured data

Example¶

from dnallm import load_config, load_model_and_tokenizer
from dnallm.inference import DNAInference

# Load model and configuration
configs = load_config("path/to/config.yaml")
model, tokenizer = load_model_and_tokenizer(
    model_name="zhangtaolab/plant-dnabert-BPE",
    task_config=configs['task']
)

# Initialize inference
inferencer = DNAInference(
    model=model,
    tokenizer=tokenizer,
    device="cuda"
)

# Run inference
results = inferencer.infer(
    data_path="path/to/sequences.txt",
    batch_size=32
)

Data Requirements¶

Sequences should be in standard nucleotide format (A, T, C, G)
Remove or replace non-standard characters (N, R, Y, etc.)
Ensure consistent sequence lengths if required by your model
Use appropriate file encoding (UTF-8 recommended)