Performance Guide¶
Optimizing performance is key to working efficiently with large DNA models. This guide provides practical tips to speed up your training and inference workflows and reduce memory consumption.
1. Speeding Up Training¶
Use Mixed-Precision Training (FP16/BF16)¶
- What it is: Using 16-bit floating-point numbers instead of 32-bit for model weights and computations.
- Why it helps: Modern GPUs have specialized Tensor Cores that process 16-bit operations much faster. It also cuts memory usage in half.
- How to use it: In your
config.yaml
, enablefp16
orbf16
.training_args: fp16: true # For most GPUs # bf16: true # For NVIDIA Ampere (A100, 30xx) or newer
Use Multiple GPUs¶
- What it is: Distributing your training job across all available GPUs on a machine.
- Why it helps: It parallelizes data processing, leading to a near-linear speedup with the number of GPUs.
- How to use it: Launch your training script with
torchrun
.# Example for a machine with 4 GPUs torchrun --nproc_per_node=4 -m dnallm.cli.finetune --config_file your_config.yaml
Enable Flash Attention¶
- What it is: A highly optimized implementation of the attention mechanism.
- Why it helps: It's faster and more memory-efficient than the standard attention, especially for longer sequences.
- How to use it: Install
flash-attn
and enable it in your model configuration.pip install flash-attn
Note: This is only supported by certain model architectures like LLaMA and Evo.# In your config.yaml model_args: attn_implementation: "flash_attention_2"
2. Reducing Memory Usage (VRAM)¶
Use Gradient Accumulation¶
- What it is: Simulating a large batch size by accumulating gradients over several smaller forward/backward passes before updating the model weights.
- Why it helps: This is the most effective way to train large models on GPUs with limited VRAM. It drastically reduces memory requirements with a minimal impact on speed.
- How to use it: Set
gradient_accumulation_steps
in your training configuration.training_args: per_device_train_batch_size: 2 # A small size that fits in memory gradient_accumulation_steps: 16 # Effective batch size = 2 * 16 = 32
Use Model Quantization (QLoRA)¶
- What it is: Loading the model with its weights converted to a lower-precision format, like 4-bit integers.
- Why it helps: It dramatically reduces the model's memory footprint, allowing you to fine-tune very large models on consumer-grade GPUs.
- How to use it: Use the
load_in_4bit
orload_in_8bit
flags when loading the model. This is typically used with LoRA (Low-Rank Adaptation).# Example of loading a model in 4-bit for QLoRA fine-tuning model, tokenizer = load_model_and_tokenizer( "GenerTeam/GENERator-eukaryote-1.2b-base", load_in_4bit=True, device_map="auto" )
3. Speeding Up Inference¶
Use Batching¶
- What it is: Grouping multiple sequences together and processing them in a single batch.
- Why it helps: It maximizes GPU utilization and is much more efficient than processing sequences one by one.
Use torch.compile
¶
- What it is: A feature in PyTorch 2.0+ that JIT-compiles your model into optimized code.
- Why it helps: It can provide a significant speedup (up to 2x) with a single line of code, especially for inference.
import torch # model = your loaded model compiled_model = torch.compile(model) # Use compiled_model for inference