Configuration Guide¶

This guide provides detailed information about all configuration options available for DNALLM fine-tuning, including examples and best practices.

Overview¶

DNALLM fine-tuning configuration is defined in YAML format and supports: - Task Configuration: Task type, labels, and thresholds - Training Configuration: Learning rates, batch sizes, and optimization - Model Configuration: Architecture, tokenizer, and source settings - Advanced Options: Custom training, monitoring, and deployment

Configuration Structure¶

Basic Configuration Schema¶

# finetune_config.yaml
task:
  # Task-specific settings
  task_type: "string"
  num_labels: integer
  label_names: []
  threshold: float

finetune:
  # Training parameters
  output_dir: "string"
  num_train_epochs: integer
  per_device_train_batch_size: integer
  learning_rate: float

  # Optimization settings
  weight_decay: float
  warmup_ratio: float
  gradient_accumulation_steps: integer

  # Monitoring and saving
  logging_strategy: "string"
  eval_strategy: "string"
  save_strategy: "string"

  # Advanced training options
  bf16: boolean
  fp16: boolean
  load_best_model_at_end: boolean

Task Configuration¶

Basic Task Settings¶

task:
  task_type: "binary"           # Required: task type
  num_labels: 2                 # Required: number of output classes
  label_names: ["neg", "pos"]   # Optional: human-readable labels
  threshold: 0.5                # Optional: classification threshold

Task Types and Requirements¶

Task Type	Required Fields	Optional Fields	Description
`binary`	`num_labels: 2`	`label_names`, `threshold`	Binary classification
`multiclass`	`num_labels: >2`	`label_names`	Multi-class classification
`multilabel`	`num_labels: >1`	`label_names`, `threshold`	Multi-label classification
`regression`	`num_labels: 1`	None	Continuous value prediction
`generation`	None	None	Sequence generation
`mask`	None	None	Masked language modeling
`token`	`num_labels: >1`	`label_names`	Token classification

Task Configuration Examples¶

Binary Classification¶

task:
  task_type: "binary"
  num_labels: 2
  label_names: ["negative", "positive"]
  threshold: 0.5

Multi-class Classification¶

task:
  task_type: "multiclass"
  num_labels: 4
  label_names: ["enzyme", "receptor", "structural", "regulatory"]

Multi-label Classification¶

task:
  task_type: "multilabel"
  num_labels: 5
  label_names: ["promoter", "enhancer", "silencer", "insulator", "locus_control"]
  threshold: 0.5

Regression¶

task:
  task_type: "regression"
  num_labels: 1

Generation¶

task:
  task_type: "generation"
  # No additional fields needed

Training Configuration¶

Basic Training Settings¶

finetune:
  # Output and logging
  output_dir: "./outputs"
  report_to: "tensorboard"

  # Training duration
  num_train_epochs: 3
  max_steps: -1  # -1 means use epochs

  # Batch sizes
  per_device_train_batch_size: 8
  per_device_eval_batch_size: 16
  gradient_accumulation_steps: 1

Optimization Settings¶

finetune:
  # Learning rate and scheduling
  learning_rate: 2e-5
  lr_scheduler_type: "linear"  # linear, cosine, cosine_with_restarts, polynomial
  warmup_ratio: 0.1
  warmup_steps: 0  # Alternative to warmup_ratio

  # Optimizer settings
  weight_decay: 0.01
  adam_beta1: 0.9
  adam_beta2: 0.999
  adam_epsilon: 1e-8

  # Gradient handling
  max_grad_norm: 1.0
  gradient_accumulation_steps: 1

Learning Rate Schedulers¶

Linear Scheduler¶

finetune:
  lr_scheduler_type: "linear"
  warmup_ratio: 0.1
  # Learning rate decreases linearly from warmup to 0

Cosine Scheduler¶

finetune:
  lr_scheduler_type: "cosine"
  warmup_ratio: 0.1
  # Learning rate follows cosine curve

Cosine with Restarts¶

finetune:
  lr_scheduler_type: "cosine_with_restarts"
  warmup_ratio: 0.1
  num_train_epochs: 6
  # Learning rate restarts every 2 epochs

Polynomial Scheduler¶

finetune:
  lr_scheduler_type: "polynomial"
  warmup_ratio: 0.1
  power: 1.0  # Polynomial power
  # Learning rate decreases polynomially

Monitoring and Evaluation¶

finetune:
  # Logging
  logging_strategy: "steps"  # steps, epoch, no
  logging_steps: 100
  logging_first_step: true

  # Evaluation
  eval_strategy: "steps"  # steps, epoch, no
  eval_steps: 100
  eval_delay: 0

  # Saving
  save_strategy: "steps"  # steps, epoch, no
  save_steps: 500
  save_total_limit: 3
  save_safetensors: true

Model Selection and Checkpointing¶

finetune:
  # Model selection
  load_best_model_at_end: true
  metric_for_best_model: "eval_loss"  # or "eval_accuracy", "eval_f1"
  greater_is_better: false  # false for loss, true for accuracy/f1

  # Checkpointing
  save_total_limit: 3
  save_safetensors: true
  resume_from_checkpoint: null  # Path to resume from

  # Early stopping
  early_stopping_patience: 3
  early_stopping_threshold: 0.001

Advanced Training Options¶

Mixed Precision Training¶

finetune:
  # Mixed precision options
  fp16: false
  bf16: false

  # FP16 specific settings
  fp16_full_eval: false
  fp16_eval: false

  # BF16 specific settings
  bf16_full_eval: false
  bf16_eval: false

Memory Optimization¶

finetune:
  # Memory optimization
  dataloader_pin_memory: true
  dataloader_num_workers: 4

  # Gradient checkpointing
  gradient_checkpointing: false

  # Memory efficient attention
  memory_efficient_attention: false

Reproducibility¶

finetune:
  # Reproducibility
  seed: 42
  deterministic: true

  # Data loading
  dataloader_drop_last: false
  remove_unused_columns: true

  # Training
  group_by_length: false
  length_column_name: "length"

Model Configuration¶

Model Loading¶

model:
  # Model source
  source: "huggingface"  # huggingface, modelscope, local

  # Model path
  path: "zhangtaolab/plant-dnabert-BPE"

  # Model options
  revision: "main"
  trust_remote_code: true
  torch_dtype: "float32"  # float32, float16, bfloat16

Tokenizer Configuration¶

tokenizer:
  # Tokenizer options
  use_fast: true
  model_max_length: 512

  # Special tokens
  pad_token: "[PAD]"
  unk_token: "[UNK]"
  mask_token: "[MASK]"
  sep_token: "[SEP]"
  cls_token: "[CLS]"

Data Configuration¶

Dataset Settings¶

dataset:
  # Data loading
  max_length: 512
  truncation: true
  padding: "max_length"

  # Data splitting
  test_size: 0.2
  val_size: 0.1
  random_state: 42

  # Data augmentation
  augment: true
  reverse_complement_ratio: 0.5
  random_mutation_ratio: 0.1

Data Preprocessing¶

dataset:
  preprocessing:
    # Sequence processing
    remove_n_bases: true
    normalize_case: true
    add_padding: true
    padding_size: 512

    # Quality filtering
    min_length: 100
    max_length: 1000
    valid_chars: "ACGT"

    # Data augmentation
    reverse_complement: true
    random_mutations: true
    mutation_rate: 0.01

Complete Configuration Examples¶

Binary Classification Example¶

task:
  task_type: "binary"
  num_labels: 2
  label_names: ["negative", "positive"]
  threshold: 0.5

finetune:
  output_dir: "./promoter_classification"
  num_train_epochs: 5
  per_device_train_batch_size: 16
  per_device_eval_batch_size: 32
  gradient_accumulation_steps: 1

  # Optimization
  learning_rate: 2e-5
  weight_decay: 0.01
  warmup_ratio: 0.1
  lr_scheduler_type: "linear"

  # Monitoring
  logging_strategy: "steps"
  logging_steps: 100
  eval_strategy: "steps"
  eval_steps: 100
  save_strategy: "steps"
  save_steps: 500
  save_total_limit: 3

  # Model selection
  load_best_model_at_end: true
  metric_for_best_model: "eval_f1"
  greater_is_better: true

  # Mixed precision
  bf16: true

  # Reproducibility
  seed: 42
  deterministic: true

  # Reporting
  report_to: "tensorboard"

Multi-class Classification Example¶

task:
  task_type: "multiclass"
  num_labels: 4
  label_names: ["enzyme", "receptor", "structural", "regulatory"]

finetune:
  output_dir: "./functional_annotation"
  num_train_epochs: 8
  per_device_train_batch_size: 16
  per_device_eval_batch_size: 32
  gradient_accumulation_steps: 2

  # Higher learning rate for multi-class
  learning_rate: 3e-5
  weight_decay: 0.02
  warmup_ratio: 0.15
  lr_scheduler_type: "cosine"

  # Monitoring
  logging_strategy: "steps"
  logging_steps: 200
  eval_strategy: "steps"
  eval_steps: 200
  save_strategy: "steps"
  save_steps: 1000
  save_total_limit: 5

  # Model selection
  load_best_model_at_end: true
  metric_for_best_model: "eval_accuracy"
  greater_is_better: true

  # Mixed precision
  fp16: true

  # Reproducibility
  seed: 42
  deterministic: true

Generation Task Example¶

task:
  task_type: "generation"

finetune:
  output_dir: "./sequence_generation"
  num_train_epochs: 15
  per_device_train_batch_size: 8
  per_device_eval_batch_size: 16
  gradient_accumulation_steps: 2

  # Higher learning rate for generation
  learning_rate: 5e-5
  weight_decay: 0.01
  warmup_ratio: 0.2
  lr_scheduler_type: "cosine_with_restarts"

  # Monitoring
  logging_strategy: "steps"
  logging_steps: 500
  eval_strategy: "steps"
  eval_steps: 500
  save_strategy: "steps"
  save_steps: 2000
  save_total_limit: 3

  # Model selection
  load_best_model_at_end: true
  metric_for_best_model: "eval_loss"
  greater_is_better: false

  # Generation settings
  generation_max_length: 512
  generation_num_beams: 4
  generation_early_stopping: true

  # Mixed precision
  bf16: true

  # Reproducibility
  seed: 42
  deterministic: true

Regression Task Example¶

task:
  task_type: "regression"
  num_labels: 1

finetune:
  output_dir: "./expression_prediction"
  num_train_epochs: 10
  per_device_train_batch_size: 16
  per_device_eval_batch_size: 32
  gradient_accumulation_steps: 1

  # Higher learning rate for regression
  learning_rate: 1e-4
  weight_decay: 0.05
  warmup_ratio: 0.1
  lr_scheduler_type: "polynomial"

  # Monitoring
  logging_strategy: "steps"
  logging_steps: 100
  eval_strategy: "steps"
  eval_steps: 100
  save_strategy: "steps"
  save_steps: 500
  save_total_limit: 3

  # Model selection
  load_best_model_at_end: true
  metric_for_best_model: "eval_rmse"
  greater_is_better: false

  # Mixed precision
  fp16: true

  # Reproducibility
  seed: 42
  deterministic: true

Environment-Specific Configurations¶

Development Configuration¶

finetune:
  # Development settings
  num_train_epochs: 1
  per_device_train_batch_size: 4
  logging_steps: 10
  eval_steps: 50
  save_steps: 100

  # Quick testing
  max_steps: 100
  eval_delay: 0

Production Configuration¶

finetune:
  # Production settings
  num_train_epochs: 10
  per_device_train_batch_size: 32
  gradient_accumulation_steps: 2

  # Robust training
  early_stopping_patience: 5
  save_total_limit: 10

  # Monitoring
  logging_steps: 500
  eval_steps: 500
  save_steps: 2000

GPU Memory Optimization¶

finetune:
  # Memory optimization
  per_device_train_batch_size: 8
  gradient_accumulation_steps: 4
  gradient_checkpointing: true
  memory_efficient_attention: true

  # Mixed precision
  bf16: true

  # Data loading
  dataloader_num_workers: 2
  dataloader_pin_memory: false

Configuration Validation¶

Schema Validation¶

DNALLM automatically validates your configuration:

from dnallm import validate_config

# Validate configuration
try:
    validate_config("finetune_config.yaml")
    print("Configuration is valid!")
except ValidationError as e:
    print(f"Configuration error: {e}")

Common Validation Errors¶

Error	Cause	Solution
`Invalid task type`	Unsupported task type	Use supported task types
`Missing required field`	Incomplete configuration	Add missing required fields
`Invalid learning rate`	Learning rate too high/low	Use reasonable values (1e-6 to 1e-3)
`Invalid batch size`	Batch size too large	Reduce batch size or use gradient accumulation

Best Practices¶

1. Configuration Organization¶

# Use descriptive names
finetune:
  output_dir: "./promoter_classification_2024"

# Group related settings
finetune:
  # Training duration
  num_train_epochs: 5
  max_steps: -1

  # Batch processing
  per_device_train_batch_size: 16
  gradient_accumulation_steps: 1

2. Environment-Specific Configs¶

# Development config
finetune:
  num_train_epochs: 1
  per_device_train_batch_size: 4

# Production config
finetune:
  num_train_epochs: 10
  per_device_train_batch_size: 32

3. Version Control¶

# Include version information
config_version: "1.0.0"
created_by: "Your Name"
created_date: "2024-01-15"
experiment_name: "promoter_classification_v1"

4. Hyperparameter Tuning¶

# Use consistent naming for experiments
finetune:
  output_dir: "./experiments/lr_{learning_rate}_bs_{per_device_train_batch_size}"

# Document hyperparameter choices
notes: "Testing different learning rates for promoter classification"
hyperparameters:
  learning_rate: "2e-5"
  batch_size: "16"
  scheduler: "linear"

Next Steps¶

After configuring your fine-tuning:

Run Your Training: Follow the Getting Started guide
Explore Task-Specific Guides: Check Task-Specific Guides
Advanced Techniques: Learn about Advanced Techniques
Real-world Examples: See Examples and Use Cases

Need help with configuration? Check our FAQ or open an issue on GitHub.