Configuration Generator¶
The DNALLM Configuration Generator is an interactive CLI tool that helps you create configuration files for various DNALLM tasks without manually writing YAML files.
Features¶
- Interactive Configuration: Step-by-step prompts guide you through configuration options
- Three Configuration Types: Support for fine-tuning, inference, and benchmark configurations
- Smart Defaults: Sensible default values for common use cases
- Validation: Built-in validation to ensure configuration correctness
- Flexible Output: Save configurations to custom file paths
Usage¶
Basic Usage¶
# Generate configuration interactively
dnallm config-generator
# Generate specific configuration type
dnallm config-generator --type finetune
dnallm config-generator --type inference
dnallm config-generator --type benchmark
# Specify output file
dnallm config-generator --output my_config.yaml
Command Line Options¶
--type, -t
: Specify configuration type (finetune, inference, benchmark)--output, -o
: Specify output file path (default: auto-generated based on type)
Configuration Types¶
1. Fine-tuning Configuration¶
Generates configuration for training/fine-tuning DNA language models.
Includes: - Task configuration (task type, labels, threshold) - Training parameters (epochs, batch size, learning rate) - Optimization settings (weight decay, warmup ratio) - Logging and evaluation settings - Inference settings for evaluation
Example Output:
task:
task_type: binary
num_labels: 2
threshold: 0.5
finetune:
output_dir: ./outputs
num_train_epochs: 3
per_device_train_batch_size: 8
learning_rate: 2e-5
weight_decay: 0.01
warmup_ratio: 0.1
logging_steps: 100
eval_steps: 100
save_steps: 500
seed: 42
inference:
batch_size: 16
max_length: 512
device: auto
num_workers: 4
output_dir: ./results
2. Inference Configuration¶
Generates configuration for running inference with trained models.
Includes: - Task configuration - Inference parameters (batch size, sequence length) - Hardware settings (device, workers) - Output configuration
Example Output:
task:
task_type: binary
num_labels: 2
threshold: 0.5
inference:
batch_size: 16
max_length: 512
device: auto
num_workers: 4
use_fp16: false
output_dir: ./results
3. Benchmark Configuration¶
Generates configuration for benchmarking multiple models.
Includes: - Benchmark metadata (name, description) - Model configurations (multiple models with sources) - Dataset configurations (multiple datasets with formats) - Evaluation metrics - Performance settings - Output and reporting options
Example Output:
benchmark:
name: DNA Model Benchmark
description: Comparing DNA language models
models:
- name: Plant DNABERT
path: zhangtaolab/plant-dnabert-BPE-promoter
source: huggingface
task_type: classification
- name: Plant DNAGPT
path: zhangtaolab/plant-dnagpt-BPE-promoter
source: huggingface
task_type: generation
datasets:
- name: promoter_data
path: data/promoters.csv
format: csv
task: binary_classification
text_column: sequence
label_column: label
metrics:
- accuracy
- f1_score
- precision
- recall
evaluation:
batch_size: 32
max_length: 512
device: auto
num_workers: 4
seed: 42
output:
format: html
path: benchmark_results
save_predictions: true
generate_plots: true
Interactive Prompts¶
The tool will guide you through each configuration section with helpful prompts:
Task Configuration¶
- Task Type: Choose from supported task types
- Number of Labels: For classification tasks
- Threshold: For binary/multilabel classification
- Label Names: Optional human-readable labels
Training Configuration¶
- Basic Settings: Output directory, epochs, batch sizes
- Learning Parameters: Learning rate, weight decay, warmup
- Advanced Options: Gradient accumulation, scheduler, precision
- Logging: Steps for logging, evaluation, and saving
Model Configuration (Benchmark)¶
- Model Details: Name, path, source
- Source Types: Hugging Face, ModelScope, local files
- Task Types: Classification, generation, embedding, etc.
- Advanced Settings: Revision, data types, trust settings
Dataset Configuration (Benchmark)¶
- Dataset Info: Name, file path, format
- Format Support: CSV, TSV, JSON, FASTA, Arrow, Parquet
- Task Types: Binary/multiclass classification, regression
- Preprocessing: Sequence length, truncation, padding
- Data Splitting: Test/validation ratios, random seed
Evaluation Configuration¶
- Performance: Batch size, sequence length, workers
- Hardware: Device selection (CPU, GPU, auto)
- Optimization: Mixed precision, memory efficiency
- Reproducibility: Random seed, deterministic mode
Output Configuration¶
- Formats: HTML, CSV, JSON, PDF reports
- Content: Predictions, embeddings, attention maps
- Visualization: Plots, charts, interactive elements
- Customization: Report titles, sections, recommendations
Examples¶
Quick Fine-tuning Setup¶
# Generate fine-tuning config with defaults
dnallm config-generator --type finetune --output my_training.yaml
# Customize specific parameters
dnallm config-generator --type finetune
# Follow prompts to set custom values
Benchmark Multiple Models¶
# Generate benchmark config
dnallm config-generator --type benchmark --output model_comparison.yaml
# Add multiple models and datasets interactively
# Configure evaluation metrics and output format
Inference Configuration¶
# Generate inference config for inference
dnallm config-generator --type inference --output inference_config.yaml
# Set batch size, device, and output options
Integration with DNALLM¶
Generated configurations can be used directly with DNALLM commands:
# Use generated config for training
dnallm train --config finetune_config.yaml
# Use generated config for inference
dnallm inference --config inference_config.yaml
# Use generated config for benchmarking
dnallm benchmark --config benchmark_config.yaml
Tips and Best Practices¶
- Start with Defaults: Use default values for initial setup, then customize as needed
- Validate Paths: Ensure all file paths in the configuration exist
- Hardware Considerations: Choose appropriate batch sizes and devices for your hardware
- Task Alignment: Ensure model task types match your dataset and evaluation goals
- Save Templates: Keep generated configs as templates for similar future tasks
Troubleshooting¶
Common Issues¶
- Invalid Task Type: Ensure task type matches your model and data
- Path Errors: Verify all file paths exist and are accessible
- Memory Issues: Reduce batch sizes for large models or limited memory
- Device Errors: Check GPU availability and CUDA installation
Getting Help¶
- Review the generated configuration file for any obvious errors
- Check DNALLM documentation for parameter descriptions
- Use smaller datasets for testing configurations
- Verify model compatibility with your chosen task type
Advanced Usage¶
Custom Metrics¶
Add custom evaluation metrics in benchmark configurations:
metrics:
- name: custom_dna_metric
class: CustomDNAMetric
parameters:
threshold: 0.5
Model Variants¶
Configure multiple variants of the same model:
models:
- name: plant-dnamamba-6mer-open_chromatin
path: zhangtaolab/plant-dnamamba-6mer-open_chromatin
source: huggingface
task_type: classification
- name: plant-dnabert-BPE-open_chromatin
path: zhangtaolab/plant-dnabert-BPE-open_chromatin
source: huggingface
task_type: classification
Data Augmentation¶
Enable data augmentation for training:
dataset:
preprocessing:
augment: true
reverse_complement_ratio: 0.5
random_mutation_ratio: 0.1
The Configuration Generator makes it easy to create comprehensive, validated configurations for all your DNALLM tasks!