Skip to content

finetune/trainer API

DNA Language Model Trainer Module.

This module implements the training process management for DNA language models, with the following main features:

  1. DNATrainer Class
  2. Unified management of model training, evaluation, and prediction processes
  3. Support for multiple task types (classification, regression, masked language modeling)
  4. Integration of task-specific prediction heads
  5. Training parameter configuration
  6. Training process monitoring and model saving

  7. Core Features:

  8. Model initialization and device management
  9. Training parameter configuration
  10. Training loop control
  11. Evaluation metrics calculation
  12. Model saving and loading
  13. Prediction result generation

  14. Supported Training Features:

  15. Automatic evaluation and best model saving
  16. Training log recording
  17. Flexible batch size settings
  18. Learning rate and weight decay configuration
  19. Distributed training support
  20. LoRA (Low-Rank Adaptation) for efficient fine-tuning
Usage Example
trainer = DNATrainer(
    model=model,
    config=config,
    datasets=datasets
)
metrics = trainer.train()

DNATrainer

DNA Language Model Trainer that supports multiple model types.

This trainer class provides a unified interface for training, evaluating, and predicting with DNA language models. It supports various task types including classification, regression, and masked language modeling.

Attributes:

Name Type Description
model

The DNA language model to be trained

task_config

Configuration for the specific task

train_config

Configuration for training parameters

datasets

Dataset for training and evaluation

extra_args

Additional training arguments

trainer

HuggingFace Trainer instance

training_args

Training arguments configuration

data_split

Available dataset splits

Examples:

trainer = DNATrainer(
    model=model,
    config=config,
    datasets=datasets
)
metrics = trainer.train()
Source code in dnallm/finetune/trainer.py
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
class DNATrainer:
    """DNA Language Model Trainer that supports multiple model types.

    This trainer class provides a unified interface for training, evaluating, and predicting
    with DNA language models. It supports various task types including classification,
    regression, and masked language modeling.

    Attributes:
        model: The DNA language model to be trained
        task_config: Configuration for the specific task
        train_config: Configuration for training parameters
        datasets: Dataset for training and evaluation
        extra_args: Additional training arguments
        trainer: HuggingFace Trainer instance
        training_args: Training arguments configuration
        data_split: Available dataset splits

    Examples:
        ```python
        trainer = DNATrainer(
            model=model,
            config=config,
            datasets=datasets
        )
        metrics = trainer.train()
        ```
    """

    def __init__(
        self,
        model: Any,
        config: dict,
        datasets: Optional[DNADataset] = None,
        extra_args: Optional[Dict] = None,
        use_lora: bool = False,
    ):
        """Initialize the DNA trainer.

        Args:
            model: The DNA language model to be trained
            config: Configuration dictionary containing task and training settings
            datasets: Dataset for training and evaluation
            extra_args: Additional training arguments to override defaults
            use_lora: Whether to use LoRA for efficient fine-tuning
        """
        self.model = model
        self.task_config = config['task']
        self.train_config = config['finetune']
        self.datasets = datasets
        self.extra_args = extra_args

        # LoRA
        if use_lora:
            print("[Info] Applying LoRA to the model...")
            lora_config = LoraConfig(
                **config["lora"]
            )
            self.model = get_peft_model(self.model, lora_config)
            self.model.print_trainable_parameters()

        # Multi-GPU support
        if torch.cuda.device_count() > 1:
            print(f"[Info] Using {torch.cuda.device_count()} GPUs.")
            self.model = torch.nn.DataParallel(self.model)

        self.set_up_trainer()

    def set_up_trainer(self):
        """Set up the HuggingFace Trainer with appropriate configurations.

        This method configures the training environment by:
        1. Setting up training arguments from configuration
        2. Configuring dataset splits (train/eval/test)
        3. Setting up task-specific metrics computation
        4. Configuring appropriate data collator for different task types
        5. Initializing the HuggingFace Trainer instance

        The method automatically handles:
        - Dataset split detection and validation
        - Task-specific data collator selection
        - Evaluation strategy configuration
        - Metrics computation setup
        """
        # Setup training arguments
        training_args = self.train_config.model_dump()
        if self.extra_args:
            training_args.update(self.extra_args)
        self.training_args = TrainingArguments(
            **training_args,
        )
        # Check if the dataset has been split
        if isinstance(self.datasets.dataset, DatasetDict):        
            self.data_split = self.datasets.dataset.keys()
        else:
            self.data_split = []
        # Get datasets
        if "train" in self.data_split:
            train_dataset = self.datasets.dataset["train"]
        else:
            if len(self.data_split) == 0:
                train_dataset = self.datasets.dataset
            else:
                raise KeyError("Cannot find train data.")
        eval_key = [x for x in self.data_split if x not in ['train', 'test']]
        if eval_key:
            eval_dataset = self.datasets.dataset[eval_key[0]]
        elif "test" in self.data_split:
            eval_dataset = self.datasets.dataset['test']
        else:
            eval_dataset = None
            self.training_args.eval_strategy = "no"

        # Get compute metrics
        compute_metrics = self.compute_task_metrics()
        # Set data collator
        if self.task_config.task_type == "mask":
            from transformers import DataCollatorForLanguageModeling
            mlm_probability = self.task_config.mlm_probability
            mlm_probability = mlm_probability if mlm_probability else 0.15
            data_collator = DataCollatorForLanguageModeling(
                tokenizer=self.datasets.tokenizer,
                mlm=True, mlm_probability=mlm_probability
            )
        elif self.task_config.task_type == "generation":
            from transformers import DataCollatorForLanguageModeling
            data_collator = DataCollatorForLanguageModeling(
                tokenizer=self.datasets.tokenizer,
                mlm=False
            )
        else:
            data_collator = None
        # Initialize trainer
        self.trainer = Trainer(
            model=self.model,
            args=self.training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics,
            data_collator=data_collator,
        )

    def compute_task_metrics(self) -> Callable:
        """Compute task-specific evaluation metrics.

        This method returns a callable function that computes appropriate metrics
        for the specific task type (classification, regression, etc.).

        Returns:
            Callable: A function that computes metrics for the specific task type
        """
        return compute_metrics(self.task_config)

    def train(self, save_tokenizer: bool = True) -> Dict[str, float]:
        """Train the model and return training metrics.

        This method executes the training process using the configured HuggingFace Trainer,
        automatically saving the best model and optionally the tokenizer.

        Args:
            save_tokenizer: Whether to save the tokenizer along with the model, default True

        Returns:
            Dictionary containing training metrics including loss, learning rate, etc.
        """
        self.model.train()
        train_result = self.trainer.train()
        metrics = train_result.metrics
        # Save the model
        self.trainer.save_model()
        if save_tokenizer:
            self.datasets.tokenizer.save_pretrained(self.train_config.output_dir)
        return metrics

    def evaluate(self) -> Dict[str, float]:
        """Evaluate the model on the evaluation dataset.

        This method runs evaluation on the configured evaluation dataset and returns
        task-specific metrics.

        Returns:
            Dictionary containing evaluation metrics for the current model state
        """
        self.model.eval()
        result = self.trainer.evaluate()
        return result

    def predict(self) -> Dict[str, float]:
        """Generate predictions on the test dataset.

        This method generates predictions on the test dataset if available and returns
        both predictions and evaluation metrics.

        Returns:
            Dictionary containing prediction results and metrics if test dataset exists,
            otherwise empty dictionary
        """
        self.model.eval()
        result = {}
        if "test" in self.data_split:
            test_dataset = self.datasets.dataset['test']
            result = self.trainer.predict(test_dataset)
        return result

__init__(model, config, datasets=None, extra_args=None, use_lora=False)

Initialize the DNA trainer.

Parameters:

Name Type Description Default
model Any

The DNA language model to be trained

required
config dict

Configuration dictionary containing task and training settings

required
datasets Optional[DNADataset]

Dataset for training and evaluation

None
extra_args Optional[Dict]

Additional training arguments to override defaults

None
use_lora bool

Whether to use LoRA for efficient fine-tuning

False
Source code in dnallm/finetune/trainer.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
def __init__(
    self,
    model: Any,
    config: dict,
    datasets: Optional[DNADataset] = None,
    extra_args: Optional[Dict] = None,
    use_lora: bool = False,
):
    """Initialize the DNA trainer.

    Args:
        model: The DNA language model to be trained
        config: Configuration dictionary containing task and training settings
        datasets: Dataset for training and evaluation
        extra_args: Additional training arguments to override defaults
        use_lora: Whether to use LoRA for efficient fine-tuning
    """
    self.model = model
    self.task_config = config['task']
    self.train_config = config['finetune']
    self.datasets = datasets
    self.extra_args = extra_args

    # LoRA
    if use_lora:
        print("[Info] Applying LoRA to the model...")
        lora_config = LoraConfig(
            **config["lora"]
        )
        self.model = get_peft_model(self.model, lora_config)
        self.model.print_trainable_parameters()

    # Multi-GPU support
    if torch.cuda.device_count() > 1:
        print(f"[Info] Using {torch.cuda.device_count()} GPUs.")
        self.model = torch.nn.DataParallel(self.model)

    self.set_up_trainer()

compute_task_metrics()

Compute task-specific evaluation metrics.

This method returns a callable function that computes appropriate metrics for the specific task type (classification, regression, etc.).

Returns:

Name Type Description
Callable Callable

A function that computes metrics for the specific task type

Source code in dnallm/finetune/trainer.py
192
193
194
195
196
197
198
199
200
201
def compute_task_metrics(self) -> Callable:
    """Compute task-specific evaluation metrics.

    This method returns a callable function that computes appropriate metrics
    for the specific task type (classification, regression, etc.).

    Returns:
        Callable: A function that computes metrics for the specific task type
    """
    return compute_metrics(self.task_config)

evaluate()

Evaluate the model on the evaluation dataset.

This method runs evaluation on the configured evaluation dataset and returns task-specific metrics.

Returns:

Type Description
Dict[str, float]

Dictionary containing evaluation metrics for the current model state

Source code in dnallm/finetune/trainer.py
224
225
226
227
228
229
230
231
232
233
234
235
def evaluate(self) -> Dict[str, float]:
    """Evaluate the model on the evaluation dataset.

    This method runs evaluation on the configured evaluation dataset and returns
    task-specific metrics.

    Returns:
        Dictionary containing evaluation metrics for the current model state
    """
    self.model.eval()
    result = self.trainer.evaluate()
    return result

predict()

Generate predictions on the test dataset.

This method generates predictions on the test dataset if available and returns both predictions and evaluation metrics.

Returns:

Type Description
Dict[str, float]

Dictionary containing prediction results and metrics if test dataset exists,

Dict[str, float]

otherwise empty dictionary

Source code in dnallm/finetune/trainer.py
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
def predict(self) -> Dict[str, float]:
    """Generate predictions on the test dataset.

    This method generates predictions on the test dataset if available and returns
    both predictions and evaluation metrics.

    Returns:
        Dictionary containing prediction results and metrics if test dataset exists,
        otherwise empty dictionary
    """
    self.model.eval()
    result = {}
    if "test" in self.data_split:
        test_dataset = self.datasets.dataset['test']
        result = self.trainer.predict(test_dataset)
    return result

set_up_trainer()

Set up the HuggingFace Trainer with appropriate configurations.

This method configures the training environment by: 1. Setting up training arguments from configuration 2. Configuring dataset splits (train/eval/test) 3. Setting up task-specific metrics computation 4. Configuring appropriate data collator for different task types 5. Initializing the HuggingFace Trainer instance

The method automatically handles: - Dataset split detection and validation - Task-specific data collator selection - Evaluation strategy configuration - Metrics computation setup

Source code in dnallm/finetune/trainer.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
def set_up_trainer(self):
    """Set up the HuggingFace Trainer with appropriate configurations.

    This method configures the training environment by:
    1. Setting up training arguments from configuration
    2. Configuring dataset splits (train/eval/test)
    3. Setting up task-specific metrics computation
    4. Configuring appropriate data collator for different task types
    5. Initializing the HuggingFace Trainer instance

    The method automatically handles:
    - Dataset split detection and validation
    - Task-specific data collator selection
    - Evaluation strategy configuration
    - Metrics computation setup
    """
    # Setup training arguments
    training_args = self.train_config.model_dump()
    if self.extra_args:
        training_args.update(self.extra_args)
    self.training_args = TrainingArguments(
        **training_args,
    )
    # Check if the dataset has been split
    if isinstance(self.datasets.dataset, DatasetDict):        
        self.data_split = self.datasets.dataset.keys()
    else:
        self.data_split = []
    # Get datasets
    if "train" in self.data_split:
        train_dataset = self.datasets.dataset["train"]
    else:
        if len(self.data_split) == 0:
            train_dataset = self.datasets.dataset
        else:
            raise KeyError("Cannot find train data.")
    eval_key = [x for x in self.data_split if x not in ['train', 'test']]
    if eval_key:
        eval_dataset = self.datasets.dataset[eval_key[0]]
    elif "test" in self.data_split:
        eval_dataset = self.datasets.dataset['test']
    else:
        eval_dataset = None
        self.training_args.eval_strategy = "no"

    # Get compute metrics
    compute_metrics = self.compute_task_metrics()
    # Set data collator
    if self.task_config.task_type == "mask":
        from transformers import DataCollatorForLanguageModeling
        mlm_probability = self.task_config.mlm_probability
        mlm_probability = mlm_probability if mlm_probability else 0.15
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=self.datasets.tokenizer,
            mlm=True, mlm_probability=mlm_probability
        )
    elif self.task_config.task_type == "generation":
        from transformers import DataCollatorForLanguageModeling
        data_collator = DataCollatorForLanguageModeling(
            tokenizer=self.datasets.tokenizer,
            mlm=False
        )
    else:
        data_collator = None
    # Initialize trainer
    self.trainer = Trainer(
        model=self.model,
        args=self.training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
        data_collator=data_collator,
    )

train(save_tokenizer=True)

Train the model and return training metrics.

This method executes the training process using the configured HuggingFace Trainer, automatically saving the best model and optionally the tokenizer.

Parameters:

Name Type Description Default
save_tokenizer bool

Whether to save the tokenizer along with the model, default True

True

Returns:

Type Description
Dict[str, float]

Dictionary containing training metrics including loss, learning rate, etc.

Source code in dnallm/finetune/trainer.py
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
def train(self, save_tokenizer: bool = True) -> Dict[str, float]:
    """Train the model and return training metrics.

    This method executes the training process using the configured HuggingFace Trainer,
    automatically saving the best model and optionally the tokenizer.

    Args:
        save_tokenizer: Whether to save the tokenizer along with the model, default True

    Returns:
        Dictionary containing training metrics including loss, learning rate, etc.
    """
    self.model.train()
    train_result = self.trainer.train()
    metrics = train_result.metrics
    # Save the model
    self.trainer.save_model()
    if save_tokenizer:
        self.datasets.tokenizer.save_pretrained(self.train_config.output_dir)
    return metrics