Configuration Guide¶
This guide provides detailed information about all configuration options available for DNALLM benchmarking, including examples and best practices.
Overview¶
DNALLM benchmarking configuration is defined in YAML format and supports: - Model Configuration: Multiple models from different sources - Dataset Configuration: Various data formats and preprocessing options - Evaluation Settings: Metrics, batch sizes, and hardware options - Output Options: Report formats and visualization settings
Configuration Structure¶
Basic Configuration Schema¶
benchmark:
# Basic information
name: "string"
description: "string"
# Model definitions
models: []
# Dataset definitions
datasets: []
# Evaluation settings
evaluation: {}
# Output configuration
output: {}
# Advanced options
advanced: {}
Model Configuration¶
Basic Model Definition¶
models:
- name: "Plant DNABERT"
path: "zhangtaolab/plant-dnabert-BPE"
source: "huggingface"
task_type: "classification"
Advanced Model Configuration¶
models:
- name: "Plant DNABERT"
path: "zhangtaolab/plant-dnabert-BPE"
source: "huggingface"
task_type: "classification"
revision: "main" # Git branch/tag
trust_remote_code: true
torch_dtype: "float16" # or "float32", "bfloat16"
device_map: "auto"
load_in_8bit: false
load_in_4bit: false
- name: "Custom Model"
path: "/path/to/local/model"
source: "local"
task_type: "generation"
model_class: "CustomModelClass"
tokenizer_class: "CustomTokenizerClass"
Model Source Types¶
Source | Description | Example |
---|---|---|
huggingface |
Hugging Face Hub | "zhangtaolab/plant-dnabert-BPE" |
modelscope |
ModelScope repository | "zhangtaolab/plant-dnabert-BPE" |
local |
Local file system | "/path/to/model" |
s3 |
AWS S3 bucket | "s3://bucket/model" |
Task Types¶
Task Type | Description | Use Case |
---|---|---|
classification |
Binary/multi-class classification | Promoter prediction, motif detection |
generation |
Sequence generation | DNA synthesis, sequence design |
masked |
Masked language modeling | Sequence completion, mutation analysis |
embedding |
Feature extraction | Sequence representation, similarity |
regression |
Continuous value prediction | Expression level, binding affinity |
Dataset Configuration¶
Basic Dataset Definition¶
datasets:
- name: "promoter_data"
path: "path/to/promoter_data.csv"
task: "binary_classification"
text_column: "sequence"
label_column: "label"
Advanced Dataset Configuration¶
datasets:
- name: "promoter_data"
path: "path/to/promoter_data.csv"
task: "binary_classification"
text_column: "sequence"
label_column: "label"
# Preprocessing options
max_length: 512
truncation: true
padding: "max_length"
# Data splitting
test_size: 0.2
val_size: 0.1
random_state: 42
# Data filtering
min_length: 100
max_length: 1000
valid_chars: "ACGT"
# Data augmentation
augment: true
reverse_complement_ratio: 0.5
random_mutation_ratio: 0.1
# Custom preprocessing
preprocessors:
- "remove_n_bases"
- "normalize_case"
- "add_padding"
Dataset Formats¶
CSV/TSV Format¶
datasets:
- name: "csv_dataset"
path: "data.csv"
format: "csv"
separator: "," # or "\t" for TSV
encoding: "utf-8"
text_column: "sequence"
label_column: "label"
additional_columns: ["metadata", "source"]
JSON Format¶
datasets:
- name: "json_dataset"
path: "data.json"
format: "json"
text_key: "sequence"
label_key: "label"
nested_path: "data.items" # For nested JSON structures
FASTA Format¶
datasets:
- name: "fasta_dataset"
path: "sequences.fasta"
format: "fasta"
label_parser: "header" # Extract label from header
header_format: "sequence_id|label:value" # Custom header format
Arrow/Parquet Format¶
datasets:
- name: "arrow_dataset"
path: "data.arrow"
format: "arrow"
text_column: "sequence"
label_column: "label"
Data Preprocessing Options¶
datasets:
- name: "processed_data"
path: "raw_data.csv"
# Sequence processing
preprocessing:
remove_n_bases: true
normalize_case: true
add_padding: true
padding_size: 512
# Quality filtering
filtering:
min_length: 200
max_length: 1000
min_gc_content: 0.2
max_gc_content: 0.8
valid_chars: "ACGT"
# Data augmentation
augmentation:
reverse_complement: true
random_mutations: true
mutation_rate: 0.01
synthetic_samples: 1000
Evaluation Configuration¶
Basic Evaluation Settings¶
evaluation:
batch_size: 32
max_length: 512
device: "cuda"
num_workers: 4
Advanced Evaluation Options¶
evaluation:
# Batch processing
batch_size: 32
gradient_accumulation_steps: 1
# Sequence processing
max_length: 512
truncation: true
padding: "max_length"
# Hardware settings
device: "cuda" # or "cpu", "auto"
num_workers: 4
pin_memory: true
# Performance optimization
use_fp16: true
use_bf16: false
mixed_precision: true
# Memory management
max_memory: "16GB"
memory_efficient_attention: true
# Reproducibility
seed: 42
deterministic: true
# Evaluation strategy
eval_strategy: "steps" # or "epoch"
eval_steps: 100
eval_accumulation_steps: 1
Device Configuration¶
evaluation:
# Single GPU
device: "cuda:0"
# Multiple GPUs
device: "cuda"
parallel_strategy: "data_parallel"
# CPU only
device: "cpu"
num_threads: 8
# Auto device selection
device: "auto"
device_map: "auto"
# Mixed precision
use_fp16: true
use_bf16: false
mixed_precision: true
Metrics Configuration¶
Basic Metrics¶
metrics:
- "accuracy"
- "f1_score"
- "precision"
- "recall"
- "roc_auc"
- "mse"
- "mae"
Advanced Metrics¶
metrics:
# Classification metrics
- "accuracy"
- "f1_score"
- "precision"
- "recall"
- "roc_auc"
- "pr_auc"
- "matthews_correlation"
# Regression metrics
- "mse"
- "mae"
- "rmse"
- "r2_score"
- "pearson_correlation"
- "spearman_correlation"
# Custom metrics
- name: "gc_content_accuracy"
class: "GCContentMetric"
parameters:
threshold: 0.1
- name: "conservation_score"
class: "ConservationMetric"
parameters:
window_size: 10
similarity_threshold: 0.8
Custom Metric Configuration¶
metrics:
- name: "custom_dna_metric"
class: "CustomDNAMetric"
parameters:
gc_weight: 0.3
conservation_weight: 0.4
motif_weight: 0.3
threshold: 0.5
file_path: "path/to/custom_metric.py"
class_name: "CustomDNAMetric"
Output Configuration¶
Basic Output Settings¶
output:
format: "html"
path: "benchmark_results"
save_predictions: true
generate_plots: true
Advanced Output Options¶
output:
# Output formats
formats: ["html", "csv", "json", "pdf"]
# File paths
path: "benchmark_results"
predictions_file: "predictions.csv"
metrics_file: "metrics.json"
plots_dir: "plots"
# Content options
save_predictions: true
save_embeddings: false
save_attention_maps: false
save_token_probabilities: false
# Visualization
generate_plots: true
plot_types: ["bar", "line", "heatmap", "scatter"]
plot_style: "seaborn"
plot_colors: ["#1f77b4", "#ff7f0e", "#2ca02c"]
# Report customization
report_title: "DNA Model Benchmark Report"
report_description: "Comprehensive comparison of DNA language models"
include_summary: true
include_details: true
include_recommendations: true
# Export options
export_predictions: true
export_metrics: true
export_config: true
export_logs: true
Report Customization¶
output:
report:
title: "DNA Model Benchmark Report"
subtitle: "Performance Comparison on Promoter Prediction"
author: "Your Name"
date: "auto"
# Sections to include
sections:
- "executive_summary"
- "model_overview"
- "dataset_description"
- "results_summary"
- "detailed_results"
- "performance_analysis"
- "recommendations"
- "appendix"
# Custom styling
styling:
theme: "modern"
color_scheme: "blue"
font_family: "Arial"
font_size: 12
# Interactive elements
interactive:
enable_zoom: true
enable_hover: true
enable_selection: true
Advanced Configuration¶
Cross-Validation Settings¶
advanced:
cross_validation:
enabled: true
method: "k_fold" # or "stratified_k_fold", "time_series_split"
n_splits: 5
shuffle: true
random_state: 42
# Stratified options
stratification:
enabled: true
column: "label"
bins: 10
# Time series options
time_series:
column: "date"
test_size: 0.2
gap: 0
Performance Profiling¶
advanced:
performance_profiling:
enabled: true
# Memory profiling
memory:
track_gpu: true
track_cpu: true
track_peak: true
profile_allocations: true
# Time profiling
timing:
track_inference: true
track_preprocessing: true
track_postprocessing: true
warmup_runs: 10
# Resource monitoring
resources:
track_cpu_usage: true
track_gpu_usage: true
track_io: true
sampling_interval: 0.1
Custom Evaluation Pipeline¶
advanced:
custom_pipeline:
enabled: true
pipeline_file: "path/to/custom_pipeline.py"
# Pipeline steps
steps:
- name: "data_preprocessing"
function: "custom_preprocess"
parameters:
normalize: true
augment: false
- name: "model_evaluation"
function: "custom_evaluate"
parameters:
metric: "custom_metric"
threshold: 0.5
- name: "result_aggregation"
function: "custom_aggregate"
parameters:
method: "weighted_average"
weights: [0.4, 0.3, 0.3]
Configuration Examples¶
Complete Example: Promoter Prediction¶
benchmark:
name: "Promoter Prediction Benchmark"
description: "Comparing DNA language models on promoter prediction tasks"
models:
- name: "Plant DNABERT"
path: "zhangtaolab/plant-dnabert-BPE"
source: "huggingface"
task_type: "classification"
- name: "Plant DNAGPT"
path: "zhangtaolab/plant-dnagpt-BPE"
source: "huggingface"
task_type: "generation"
- name: "Nucleotide Transformer"
path: "InstaDeepAI/nucleotide-transformer-500m-human-ref"
source: "huggingface"
task_type: "classification"
datasets:
- name: "promoter_strength"
path: "data/promoter_strength.csv"
task: "binary_classification"
text_column: "sequence"
label_column: "label"
max_length: 512
test_size: 0.2
val_size: 0.1
- name: "open_chromatin"
path: "data/open_chromatin.csv"
task: "binary_classification"
text_column: "sequence"
label_column: "label"
max_length: 512
metrics:
- "accuracy"
- "f1_score"
- "precision"
- "recall"
- "roc_auc"
- name: "gc_content_accuracy"
class: "GCContentMetric"
evaluation:
batch_size: 32
max_length: 512
device: "cuda"
num_workers: 4
use_fp16: true
seed: 42
output:
format: "html"
path: "promoter_benchmark_results"
save_predictions: true
generate_plots: true
report_title: "Promoter Prediction Model Comparison"
advanced:
cross_validation:
enabled: true
method: "stratified_k_fold"
n_splits: 5
performance_profiling:
enabled: true
memory:
track_gpu: true
track_peak: true
Minimal Example¶
benchmark:
name: "Quick Model Test"
models:
- name: "Test Model"
path: "zhangtaolab/plant-dnabert-BPE"
source: "huggingface"
task_type: "classification"
datasets:
- name: "test_data"
path: "test.csv"
task: "binary_classification"
text_column: "sequence"
label_column: "label"
metrics:
- "accuracy"
- "f1_score"
evaluation:
batch_size: 16
device: "cuda"
output:
format: "csv"
path: "quick_test_results"
Configuration Validation¶
Schema Validation¶
DNALLM automatically validates your configuration:
from dnallm import validate_config
# Validate configuration
try:
validate_config("benchmark_config.yaml")
print("Configuration is valid!")
except ValidationError as e:
print(f"Configuration error: {e}")
Common Validation Errors¶
Error | Cause | Solution |
---|---|---|
Model not found |
Invalid model path | Check model exists on specified source |
Invalid task type |
Unsupported task | Use supported task types |
Missing required field |
Incomplete configuration | Add missing required fields |
Invalid metric name |
Unknown metric | Use supported metric names |
Path not found |
Invalid file path | Check file exists and is accessible |
Best Practices¶
1. Configuration Organization¶
# Use descriptive names
benchmark:
name: "Comprehensive DNA Model Evaluation 2024"
# Group related settings
evaluation:
# Hardware settings
device: "cuda"
num_workers: 4
# Performance settings
batch_size: 32
use_fp16: true
2. Environment-Specific Configs¶
# Development config
evaluation:
batch_size: 8
device: "cpu"
# Production config
evaluation:
batch_size: 64
device: "cuda"
use_fp16: true
3. Version Control¶
# Include version information
benchmark:
version: "1.0.0"
config_version: "2024.1"
created_by: "Your Name"
created_date: "2024-01-15"
Next Steps¶
After configuring your benchmark:
- Run Your Benchmark: Follow the Getting Started guide
- Explore Advanced Features: Learn about Advanced Techniques
- See Real Examples: Check Examples and Use Cases
- Troubleshoot Issues: Visit Troubleshooting
Need help with configuration? Check our FAQ or open an issue on GitHub.