Prediction Data Preparation¶
This notebook demonstrates how to prepare data for inference/prediction tasks.
Overview¶
For inference tasks, you need to prepare your input sequences in the correct format expected by the DNALLM inference pipeline.
Expected Input Format¶
DNALLM inference accepts:
- Plain text files (
.txt,.fasta) with one sequence per line - CSV files with sequence and label columns
- JSON files with structured data
Example¶
from dnallm import load_config, load_model_and_tokenizer
from dnallm.inference import DNAInference
# Load model and configuration
configs = load_config("path/to/config.yaml")
model, tokenizer = load_model_and_tokenizer(
model_name="zhangtaolab/plant-dnabert-BPE",
task_config=configs['task']
)
# Initialize inference
inferencer = DNAInference(
model=model,
tokenizer=tokenizer,
device="cuda"
)
# Run inference
results = inferencer.infer(
data_path="path/to/sequences.txt",
batch_size=32
)
Data Requirements¶
- Sequences should be in standard nucleotide format (A, T, C, G)
- Remove or replace non-standard characters (N, R, Y, etc.)
- Ensure consistent sequence lengths if required by your model
- Use appropriate file encoding (UTF-8 recommended)