Evaluation Metrics¶

dnallm.tasks.metrics ¶

DNA Language Model Evaluation Metrics Module.

This module provides comprehensive evaluation metrics for DNA language models across various task types including classification, regression, and token classification.

Supported task types: - Binary classification: accuracy, precision, recall, F1, MCC, AUROC, AUPRC - Multi-class classification: macro/micro/weighted metrics - Multi-label classification: per-label and overall metrics - Regression: MSE, MAE, R², Spearman correlation - Token classification: sequence-level accuracy, precision, recall, F1

The module integrates both scikit-learn metrics and HuggingFace evaluate library for comprehensive model evaluation.

Classes¶

Functions¶

calculate_metric_with_sklearn ¶

calculate_metric_with_sklearn(eval_pred)

Calculate basic classification metrics using scikit-learn.

This function computes standard classification metrics for token classification tasks, handling padding tokens and reshaping logits as needed.

Args:
    eval_pred: Tuple containing (logits, labels)

Returns:
    Dictionary containing accuracy, F1, Matthews correlation,
    precision, and recall

Source code in dnallm/tasks/metrics.py

def calculate_metric_with_sklearn(eval_pred):
    """Calculate basic classification metrics using scikit-learn.

    This function computes standard classification metrics for token
        classification tasks,
        handling padding tokens and reshaping logits as needed.

        Args:
            eval_pred: Tuple containing (logits, labels)

        Returns:
            Dictionary containing accuracy, F1, Matthews correlation,
            precision, and recall
    """
    logits, labels = eval_pred
    if isinstance(logits, tuple):  # Unpack logits if it's a tuple
        logits = logits[0]
    if logits.ndim == 3:
        # Reshape logits to 2D if needed
        logits = logits.reshape(-1, logits.shape[-1])
    predictions = np.argmax(logits, axis=-1)
    valid_mask = (
        labels != -100
    )  # Exclude padding tokens (assuming -100 is the padding token ID)
    valid_predictions = predictions[valid_mask]
    valid_labels = labels[valid_mask]
    print(valid_labels.shape, valid_predictions.shape)
    return {
        "accuracy": accuracy_score(valid_labels, valid_predictions),
        "f1": f1_score(
            valid_labels, valid_predictions, average="macro", zero_division=0
        ),
        "matthews_correlation": matthews_corrcoef(
            valid_labels, valid_predictions
        ),
        "precision": precision_score(
            valid_labels, valid_predictions, average="macro", zero_division=0
        ),
        "recall": recall_score(
            valid_labels, valid_predictions, average="macro", zero_division=0
        ),
    }

classification_metrics ¶

classification_metrics(plot=False)

Create metrics computation function for binary classification tasks.

This function returns a callable that computes comprehensive binary classification metrics including accuracy, precision, recall, F1, MCC, AUROC, AUPRC, and confusion matrix derived metrics.

Args:
    plot: Whether to include curve data for plotting (ROC and PR
        curves)

Returns:
    Callable function that computes binary classification metrics

Source code in dnallm/tasks/metrics.py

def classification_metrics(plot: bool = False):
    """Create metrics computation function for binary classification tasks.

    This function returns a callable that computes comprehensive binary
        classification metrics including accuracy, precision, recall, F1, MCC,
        AUROC, AUPRC, and
        confusion matrix derived metrics.

        Args:
            plot: Whether to include curve data for plotting (ROC and PR
                curves)

        Returns:
            Callable function that computes binary classification metrics
    """
    # clf_metrics = evaluate.combine(
    #     [
    #         metrics_path + "accuracy/accuracy.py",
    #         metrics_path + "f1/f1.py",
    #         metrics_path + "precision/precision.py",
    #         metrics_path + "recall/recall.py",
    #         metrics_path + "matthews_correlation/matthews_correlation.py",
    #     ]
    # )
    # auc_metric = evaluate.load(metrics_path + "roc_auc/roc_auc.py", "binary")

    def compute_metrics(eval_pred: tuple) -> dict:
        logits, labels = eval_pred
        logits = logits[0] if isinstance(logits, tuple) else logits
        predictions = np.argmax(logits, axis=-1)
        pred_probs = softmax(logits, axis=1)
        # Handle NaN and infinite values in pred_probs
        pred_probs = np.nan_to_num(pred_probs, nan=0.0, posinf=1.0, neginf=0.0)
        # metrics = clf_metrics.compute(predictions=predictions,
        # references=labels)
        # roc_auc = auc_metric.compute(references=labels,
        # prediction_scores=pred_probs[:, 1])
        # pr_auc = average_precision_score(y_true=labels, y_score=pred_probs[:,
        # 1])
        # metrics["AUROC"] = roc_auc["roc_auc"]
        # metrics["AUPRC"] = pr_auc
        metrics = {}
        metrics["accuracy"] = accuracy_score(labels, predictions)
        metrics["precision"] = precision_score(labels, predictions)
        metrics["recall"] = recall_score(labels, predictions)
        metrics["f1"] = f1_score(labels, predictions)
        metrics["mcc"] = matthews_corrcoef(labels, predictions)
        metrics["AUROC"] = roc_auc_score(labels, pred_probs[:, 1])
        metrics["AUPRC"] = average_precision_score(labels, pred_probs[:, 1])
        tn, fp, fn, tp = sklearn.metrics.confusion_matrix(
            labels, predictions
        ).ravel()
        metrics["TPR"] = tp / (tp + fn) if (tp + fn) > 0 else 0
        metrics["TNR"] = tn / (tn + fp) if (tn + fp) > 0 else 0
        metrics["FPR"] = fp / (fp + tn) if (fp + tn) > 0 else 0
        metrics["FNR"] = fn / (fn + tp) if (fn + tp) > 0 else 0
        if plot:
            fpr, tpr, _ = roc_curve(labels, pred_probs[:, 1])
            precision, recall, _ = precision_recall_curve(
                labels, pred_probs[:, 1]
            )
            metrics["curve"] = {
                "fpr": fpr,
                "tpr": tpr,
                "precision": precision,
                "recall": recall,
            }
        return metrics

    return compute_metrics

compute_metrics ¶

compute_metrics(task_config, plot=False)

Compute metrics based on task type.

This function serves as the main entry point for metrics computation, automatically selecting the appropriate metrics function based on the task configuration.

Parameters:

Name	Type	Description	Default
`task_config`	`TaskConfig`	Task configuration object containing task type and	required

    parameters
plot: Whether to include plotting data for visualization

Returns:

Type	Description
	Callable function that computes appropriate metrics for the task type

Raises:

Type	Description
`ValueError`	If task type is not supported for evaluation

Source code in dnallm/tasks/metrics.py

def compute_metrics(task_config: TaskConfig, plot: bool = False):
    """Compute metrics based on task type.

    This function serves as the main entry point for metrics computation,
    automatically selecting the appropriate metrics function based on the
    task configuration.

    Args:
                task_config: Task configuration object containing task type and
            parameters
        plot: Whether to include plotting data for visualization

    Returns:
        Callable function that computes appropriate metrics for the task type

    Raises:
        ValueError: If task type is not supported for evaluation
    """
    if task_config.task_type == "binary":
        return classification_metrics(plot=plot)
    elif task_config.task_type == "multiclass":
        return multi_classification_metrics(task_config.label_names, plot=plot)
    elif task_config.task_type == "multilabel":
        return multi_labels_metrics(task_config.label_names, plot=plot)
    elif task_config.task_type == "regression":
        return regression_metrics(plot=plot)
    elif task_config.task_type == "token":
        return token_classification_metrics(task_config.label_names, plot=plot)
    else:
        raise ValueError(
            f"Unsupported task type for evaluation: {task_config.task_type}"
        )

metrics_for_dnabert2 ¶

metrics_for_dnabert2(task)

Create metrics computation function for DNABERT2 model evaluation.

This function provides specialized metrics computation for DNABERT2 models, supporting both regression and classification tasks with appropriate metric selection for each task type.

Parameters:

Name	Type	Description	Default
`task`	`str`	Task type ('regression' or 'classification')	required

Returns:

Type	Description
`tuple[callable, callable]`	Tuple containing: - compute_metrics: Function for computing task-specific metrics - preprocess_logits_for_metrics: Function for preprocessing logits

Source code in dnallm/tasks/metrics.py

def metrics_for_dnabert2(task: str) -> tuple[callable, callable]:
    """Create metrics computation function for DNABERT2 model evaluation.

    This function provides specialized metrics computation for DNABERT2 models,
    supporting both regression and classification tasks with appropriate
    metric selection for each task type.

    Args:
        task: Task type ('regression' or 'classification')

    Returns:
        Tuple containing:
            - compute_metrics: Function for computing task-specific metrics
            - preprocess_logits_for_metrics: Function for preprocessing logits
    """
    r2_metric = evaluate.load("r_squared")
    spm_metric = evaluate.load("spearmanr")
    clf_metrics = evaluate.combine([
        "accuracy",
        "f1",
        "precision",
        "recall",
        "matthews_correlation",
    ])
    metric1 = evaluate.load("precision")
    metric2 = evaluate.load("recall")
    metric3 = evaluate.load("f1")
    metric4 = evaluate.load("matthews_correlation")
    roc_metric = evaluate.load("roc_auc", "multiclass")

    def compute_metrics(eval_pred: tuple) -> dict[str, Any]:
        logits, labels = eval_pred
        if task.lower() == "regression":
            r2 = r2_metric.compute(references=labels, predictions=logits[0])
            spearman = spm_metric.compute(
                references=labels, predictions=logits[0]
            )
            return {"r2": r2, "spearmanr": spearman["spearmanr"]}
        else:
            if task.lower() == "classification":
                predictions = torch.argmax(torch.from_numpy(logits[0]), dim=-1)
                return cast(
                    dict[str, Any],
                    clf_metrics.compute(
                        predictions=predictions, references=labels
                    ),
                )
            else:
                pred_probs = softmax(logits[0], axis=1)
                pred_list: list[int] = [
                    x.tolist().index(max(x)) for x in pred_probs
                ]
                precision = metric1.compute(
                    predictions=pred_list, references=labels, average="micro"
                )
                recall = metric2.compute(
                    predictions=pred_list, references=labels, average="micro"
                )
                f1 = metric3.compute(
                    predictions=pred_list, references=labels, average="micro"
                )
                mcc = metric4.compute(predictions=pred_list, references=labels)
                roc_auc_ovr = roc_metric.compute(
                    references=labels,
                    prediction_scores=pred_probs,
                    multi_class="ovr",
                )
                roc_auc_ovo = roc_metric.compute(
                    references=labels,
                    prediction_scores=pred_probs,
                    multi_class="ovo",
                )
                return {
                    **precision,
                    **recall,
                    **f1,
                    **mcc,
                    "AUROC_ovr": roc_auc_ovr["roc_auc"],
                    "AUROC_ovo": roc_auc_ovo["roc_auc"],
                }

    preprocessing = preprocess_logits_for_metrics

    return compute_metrics, preprocessing

multi_classification_metrics ¶

multi_classification_metrics(label_list, plot=False)

Create metrics computation function for multi-class classification tasks.

This function returns a callable that computes comprehensive multi-class classification metrics including accuracy, precision, recall, F1, MCC, AUROC, AUPRC, and confusion matrix derived metrics with multiple averaging strategies.

Parameters:

Name	Type	Description	Default
`label_list`	`list`	List of class labels	required
`plot`	`bool`	Whether to include curve data for plotting (ROC and PR curves)	`False`

Returns:

Type	Description
	Callable function that computes multi-class classification metrics

Source code in dnallm/tasks/metrics.py

def multi_classification_metrics(label_list: list, plot: bool = False):
    """Create metrics computation function for multi-class classification
    tasks.

    This function returns a callable that computes comprehensive multi-class
    classification metrics including accuracy, precision, recall, F1, MCC,
        AUROC, AUPRC, and
        confusion matrix derived metrics with multiple averaging strategies.

    Args:
        label_list: List of class labels
        plot: Whether to include curve data for plotting (ROC and PR curves)

    Returns:
        Callable function that computes multi-class classification metrics
    """
    # metric0 = evaluate.load(metrics_path + "accuracy/accuracy.py")
    # metric1 = evaluate.load(metrics_path + "precision/precision.py")
    # metric2 = evaluate.load(metrics_path + "recall/recall.py")
    # metric3 = evaluate.load(metrics_path + "f1/f1.py")
    # metric4 = evaluate.load(metrics_path +
    # "matthews_correlation/matthews_correlation.py")
    # roc_metric = evaluate.load(metrics_path + "roc_auc/roc_auc.py",
    # "multiclass")

    def compute_metrics(eval_pred: tuple) -> dict:
        logits, labels = eval_pred
        logits = logits[0] if isinstance(logits, tuple) else logits
        if logits.ndim == 3:
            logits = logits[:, 0, :]  # 调整此处以适应模型结构
        pred_probs = softmax(logits, axis=1)
        # Handle NaN and infinite values in pred_probs
        pred_probs = np.nan_to_num(pred_probs, nan=0.0, posinf=1.0, neginf=0.0)
        predictions = np.argmax(logits, axis=-1)

        metrics = {}
        metrics["accuracy"] = accuracy_score(labels, predictions)
        metrics["precision"] = precision_score(
            labels, predictions, average="macro"
        )
        metrics["recall"] = recall_score(labels, predictions, average="macro")
        metrics["f1"] = f1_score(labels, predictions, average="macro")
        metrics["precision_micro"] = precision_score(
            labels, predictions, average="micro"
        )
        metrics["recall_micro"] = recall_score(
            labels, predictions, average="micro"
        )
        metrics["f1_micro"] = f1_score(labels, predictions, average="micro")
        metrics["precision_weighted"] = precision_score(
            labels, predictions, average="weighted"
        )
        metrics["recall_weighted"] = recall_score(
            labels, predictions, average="weighted"
        )
        metrics["f1_weighted"] = f1_score(
            labels, predictions, average="weighted"
        )
        metrics["mcc"] = matthews_corrcoef(labels, predictions)
        metrics["AUROC"] = roc_auc_score(
            labels, pred_probs, average="macro", multi_class="ovr"
        )
        metrics["AUPRC"] = average_precision_score(
            labels, pred_probs, average="macro"
        )
        tpr_list, tnr_list, fpr_list, fnr_list = [], [], [], []
        for label_cnt in multilabel_confusion_matrix(labels, predictions):
            tn, fp, fn, tp = label_cnt.ravel()
            # 避免除 0
            tpr_list.append(tp / (tp + fn) if (tp + fn) > 0 else 0)
            tnr_list.append(tn / (tn + fp) if (tn + fp) > 0 else 0)
            fpr_list.append(fp / (fp + tn) if (fp + tn) > 0 else 0)
            fnr_list.append(fn / (fn + tp) if (fn + tp) > 0 else 0)
        metrics["TPR"] = float(np.mean(tpr_list))
        metrics["TNR"] = float(np.mean(tnr_list))
        metrics["FPR"] = float(np.mean(fpr_list))
        metrics["FNR"] = float(np.mean(fnr_list))
        # accuracy = metric0.compute(predictions=predictions,
        # references=labels)
        # precision = metric1.compute(predictions=predictions,
        # references=labels, average="micro")
        # recall = metric2.compute(predictions=predictions, references=labels,
        # average="micro")
        # f1 = metric3.compute(predictions=predictions, references=labels,
        # average="micro")
        # mcc = metric4.compute(predictions=predictions, references=labels)
        # roc_auc_ovr = roc_metric.compute(references=labels,
        #                                  prediction_scores=pred_probs,
        #                                  multi_class='ovr')
        # roc_auc_ovo = roc_metric.compute(references=labels,
        #                                  prediction_scores=pred_probs,
        #                                  multi_class='ovo')
        # metrics = {**accuracy, **precision, **recall, **f1, **mcc,
        # "AUROC_ovr": roc_auc_ovr['roc_auc'], "AUROC_ovo":
        # roc_auc_ovo['roc_auc']}
        # metrics["AUROC_ovr"] = roc_auc_ovr['roc_auc']
        # metrics["AUROC_ovo"] = roc_auc_ovo['roc_auc']
        if plot:
            label_curves = {}
            for i, label_name in enumerate(label_list):
                fpr, tpr, _ = roc_curve(
                    np.array(labels) == i, pred_probs[:, i]
                )
                prec, rec, _ = precision_recall_curve(
                    np.array(labels) == i, pred_probs[:, i]
                )
                label_curves[label_name] = {
                    "fpr": fpr,
                    "tpr": tpr,
                    "precision": prec,
                    "recall": rec,
                }
            # overall fpr, tpr for macro-average ROC curve
            # and precision-recall curve can be added here if needed
            all_fpr = np.unique(
                np.concatenate([
                    label_curves[label]["fpr"] for label in label_list
                ])
            )
            mean_tpr = np.zeros_like(all_fpr)
            for label in label_list:
                mean_tpr += np.interp(
                    all_fpr,
                    label_curves[label]["fpr"],
                    label_curves[label]["tpr"],
                )
            mean_tpr /= len(label_list)
            all_recall = np.unique(
                np.concatenate([
                    label_curves[label]["recall"] for label in label_list
                ])
            )
            mean_precision = np.zeros_like(all_recall)
            for label in label_list:
                mean_precision += np.interp(
                    all_recall,
                    label_curves[label]["recall"],
                    label_curves[label]["precision"],
                )
            mean_precision /= len(label_list)

            fpr = all_fpr
            tpr = mean_tpr
            precision = mean_precision
            recall = all_recall
            metrics["curve"] = {
                "fpr": fpr,
                "tpr": tpr,
                "precision": precision,
                "recall": recall,
            }
        return metrics

    return compute_metrics

multi_labels_metrics ¶

multi_labels_metrics(label_list, plot=False)

Create metrics computation function for multi-label classification tasks.

This function returns a callable that computes comprehensive multi-label classification metrics including per-label and overall metrics, with support for ROC curves and precision-recall curves for each label.

Parameters:

Name	Type	Description	Default
`label_list`	`list`	List of label names for multi-label classification	required
`plot`	`bool`	Whether to include curve data for plotting (ROC and PR curves)	`False`

Returns:

Type	Description
	Callable function that computes multi-label classification metrics

Source code in dnallm/tasks/metrics.py

def multi_labels_metrics(label_list: list, plot: bool = False):
    """Create metrics computation function for multi-label classification
    tasks.

    This function returns a callable that computes comprehensive multi-label
        classification metrics including per-label and
        overall metrics, with support
    for ROC curves and precision-recall curves for each label.

    Args:
        label_list: List of label names for multi-label classification
        plot: Whether to include curve data for plotting (ROC and PR curves)

    Returns:
        Callable function that computes multi-label classification metrics
    """
    # metric0 = evaluate.load(metrics_path + "accuracy/accuracy.py")
    # metric1 = evaluate.load(metrics_path + "precision/precision.py")
    # metric2 = evaluate.load(metrics_path + "recall/recall.py")
    # metric3 = evaluate.load(metrics_path + "f1/f1.py")

    def sigmoid(x):
        return 1 / (1 + np.exp(-x))

    def compute_metrics(eval_pred: tuple) -> dict:
        logits, labels = eval_pred
        if hasattr(logits, "numpy"):
            logits = logits.numpy()
        if hasattr(labels, "numpy"):
            labels = labels.numpy()
        pred_probs = sigmoid(logits)
        # Handle NaN and infinite values in pred_probs
        pred_probs = np.nan_to_num(pred_probs, nan=0.0, posinf=1.0, neginf=0.0)
        raw_pred = (pred_probs > 0.5).astype(int)
        # predictions = raw_pred.reshape(-1) # Not used in current
        # implementation
        # y_true = labels.astype(int).reshape(-1) # Not used in current
        # implementation

        # accuracy = metric0.compute(predictions=predictions,
        # references=y_true)
        # precision = metric1.compute(predictions=predictions,
        # references=y_true, average="macro")
        # recall = metric2.compute(predictions=predictions, references=y_true,
        # average="macro")
        # f1 = metric3.compute(predictions=predictions, references=y_true,
        # average="macro")
        # metrics = {**accuracy, **precision, **recall, **f1}
        metrics = {}
        metrics["accuracy"] = accuracy_score(labels, raw_pred)
        metrics["precision"] = precision_score(
            labels, raw_pred, average="macro"
        )
        metrics["recall"] = recall_score(labels, raw_pred, average="macro")
        metrics["f1"] = f1_score(labels, raw_pred, average="macro")
        metrics["precision_micro"] = precision_score(
            labels, raw_pred, average="micro"
        )
        metrics["recall_micro"] = recall_score(
            labels, raw_pred, average="micro"
        )
        metrics["f1_micro"] = f1_score(labels, raw_pred, average="micro")
        metrics["precision_weighted"] = precision_score(
            labels, raw_pred, average="weighted"
        )
        metrics["recall_weighted"] = recall_score(
            labels, raw_pred, average="weighted"
        )
        metrics["f1_weighted"] = f1_score(labels, raw_pred, average="weighted")
        metrics["precision_samples"] = precision_score(
            labels, raw_pred, average="samples"
        )
        metrics["recall_samples"] = recall_score(
            labels, raw_pred, average="samples"
        )
        metrics["f1_samples"] = f1_score(labels, raw_pred, average="samples")
        mcc_per_label = {}
        roc_data, roc_auc = {}, {}
        pr_data, pr_auc = {}, {}
        for i in range(labels.shape[1]):
            # Compute matthews correlation coefficient for each class
            mcc_per_label[label_list[i]] = matthews_corrcoef(
                labels[:, i], raw_pred[:, i]
            )
            # Compute ROC curve and ROC area for each class
            fpr, tpr, _ = roc_curve(labels[:, i], pred_probs[:, i])
            auc = roc_auc_score(labels[:, i], pred_probs[:, i])
            roc_data[label_list[i]] = (fpr, tpr)
            roc_auc[label_list[i]] = auc
            # Compute PR curve and PR area for each class
            prec, rec, _ = precision_recall_curve(
                labels[:, i], pred_probs[:, i]
            )
            ap = average_precision_score(labels[:, i], pred_probs[:, i])
            pr_data[label_list[i]] = (prec, rec)
            pr_auc[label_list[i]] = ap
        metrics["mcc"] = np.mean(list(mcc_per_label.values()))
        metrics["AUROC"] = np.mean(list(roc_auc.values()))
        metrics["AUPRC"] = np.mean(list(pr_auc.values()))
        tpr_list, tnr_list, fpr_list, fnr_list = [], [], [], []
        for label_cnt in multilabel_confusion_matrix(labels, raw_pred):
            tn, fp, fn, tp = label_cnt.ravel()
            # 避免除 0
            tpr_list.append(tp / (tp + fn) if (tp + fn) > 0 else 0)
            tnr_list.append(tn / (tn + fp) if (tn + fp) > 0 else 0)
            fpr_list.append(fp / (fp + tn) if (fp + tn) > 0 else 0)
            fnr_list.append(fn / (fn + tp) if (fn + tp) > 0 else 0)
        metrics["TPR"] = float(np.mean(tpr_list))
        metrics["TNR"] = float(np.mean(tnr_list))
        metrics["FPR"] = float(np.mean(fpr_list))
        metrics["FNR"] = float(np.mean(fnr_list))

        if plot:
            metrics["curve"] = {}
            for label in label_list:
                metrics["curve"][label] = {
                    "fpr": roc_data[label][0],
                    "tpr": roc_data[label][1],
                    "AUROC": roc_auc[label],
                    "precision": pr_data[label][0],
                    "recall": pr_data[label][1],
                    "AUPRC": pr_auc[label],
                }
        return metrics

    return compute_metrics

preprocess_logits_for_metrics ¶

preprocess_logits_for_metrics(logits, labels)

Preprocess logits for metrics computation.

This function handles logits preprocessing to avoid memory leaks in the original Trainer implementation.

Parameters:

Name	Type	Description	Default
`logits`	`Tensor`	Model output logits	required
`labels`	`Tensor`	Ground truth labels	required

Returns:

Type	Description
`Tensor`	Tuple of (processed_logits, labels)

Source code in dnallm/tasks/metrics.py

def preprocess_logits_for_metrics(
    logits: torch.Tensor, labels: torch.Tensor
) -> torch.Tensor:
    """Preprocess logits for metrics computation.

    This function handles logits preprocessing to avoid memory leaks
    in the original Trainer implementation.

    Args:
        logits: Model output logits
        labels: Ground truth labels

    Returns:
        Tuple of (processed_logits, labels)
    """
    logits = logits[0] if isinstance(logits, tuple) else logits

    return logits

regression_metrics ¶

regression_metrics(plot=False)

Create metrics computation function for regression tasks.

This function returns a callable that computes regression metrics including MSE, MAE, R2, and Spearman correlation. For multi-output regression, it uses scikit-learn metrics directly.

Parameters:

Name	Type	Description	Default
`plot`		Whether to include scatter plot data for visualization	`False`

Returns:

Type	Description
	Callable function that computes regression metrics

Source code in dnallm/tasks/metrics.py

def regression_metrics(plot=False):
    """Create metrics computation function for regression tasks.

    This function returns a callable that computes regression metrics including
    MSE, MAE, R2, and Spearman correlation. For multi-output regression,
    it uses scikit-learn metrics directly.

    Args:
        plot: Whether to include scatter plot data for visualization

    Returns:
        Callable function that computes regression metrics
    """
    mse_metric = evaluate.load(metrics_path + "mse/mse.py")
    mae_metric = evaluate.load(metrics_path + "mae/mae.py")
    r2_metric = evaluate.load(metrics_path + "r_squared/r_squared.py")
    pearson_metric = evaluate.load(metrics_path + "pearsonr/pearsonr.py")
    spm_metric = evaluate.load(metrics_path + "spearmanr/spearmanr.py")

    def pearson_macro(y_true, y_pred):
        rs = []
        for k in range(y_true.shape[1]):
            yt = y_true[:, k]
            yp = y_pred[:, k]
            if np.std(yt) == 0:
                continue
            r = pearson_metric.compute(
                predictions=yp.tolist(), references=yt.tolist()
            )["pearsonr"]
            rs.append(r)
        return np.mean(rs), rs

    def spearman_macro(y_true, y_pred):
        rs = []
        for k in range(y_true.shape[1]):
            yt = y_true[:, k]
            yp = y_pred[:, k]
            if np.std(yt) == 0:
                continue
            r = spm_metric.compute(
                predictions=yp.tolist(), references=yt.tolist()
            )["spearmanr"]
            rs.append(r)
        return np.mean(rs), rs

    def compute_metrics(eval_pred: tuple) -> dict[str, float]:
        logits, labels = eval_pred
        logits = logits[0] if isinstance(logits, tuple) else logits
        num_outputs = logits.shape[1]
        if num_outputs > 1:
            mse = mean_squared_error(labels, logits)
            mae = mean_absolute_error(labels, logits)
            r2 = r2_score(labels, logits, multioutput="uniform_average")
            pearsonr, _ = pearson_macro(labels, logits)
            spearmanr, _ = spearman_macro(labels, logits)
            metrics = {
                "mse": mse,
                "mae": mae,
                "r2": r2,
                "pearsonr": pearsonr,
                "spearmanr": spearmanr,
            }
        else:
            mse = mse_metric.compute(references=labels, predictions=logits)
            mae = mae_metric.compute(references=labels, predictions=logits)
            r2 = r2_metric.compute(references=labels, predictions=logits)
            spearmanr = spm_metric.compute(
                references=labels, predictions=logits
            )
            metrics = {**mse, **mae, "r2": r2, **spearmanr}
            if plot:
                # Fix: logits is already a numpy array,
                # no need to call .numpy()
                if hasattr(logits, "numpy"):
                    predicted = logits.numpy().flatten()
                else:
                    predicted = logits.flatten()
                metrics["scatter"] = {
                    "predicted": predicted,
                    "experiment": labels,
                }
        return metrics

    return compute_metrics

token_classification_metrics ¶

token_classification_metrics(
    label_list, plot=False, scheme="IOB2"
)

Create metrics computation function for token classification tasks.

This function returns a callable that computes sequence-level metrics for token classification tasks using the seqeval library, supporting various tagging schemes like IOB2.

Parameters:

Name	Type	Description	Default
`label_list`	`list`	List of label names for token classification	required
`plot`	`bool`	Whether to include plotting data (currently not implemented)	`False`
`scheme`	`str`	Tagging scheme for sequence evaluation (e.g., "IOB2", "BIOES")	`'IOB2'`

Returns:

Type	Description
	Callable function that computes token classification metrics

Source code in dnallm/tasks/metrics.py

def token_classification_metrics(
    label_list: list, plot: bool = False, scheme: str = "IOB2"
):
    """Create metrics computation function for token classification tasks.

    This function returns a callable that computes sequence-level metrics for
    token classification tasks using the seqeval library, supporting various
    tagging schemes like IOB2.

    Args:
        label_list: List of label names for token classification
        plot: Whether to include plotting data (currently not implemented)
        scheme: Tagging scheme for sequence evaluation (e.g., "IOB2", "BIOES")

    Returns:
        Callable function that computes token classification metrics
    """
    seqeval = evaluate.load(metrics_path + "seqeval/seqeval.py")

    def compute_metrics(pred: tuple) -> dict:
        predictions, labels = pred
        predictions = np.argmax(predictions, axis=-1)

        # 将id转换为原始的字符串类型的标签
        true_predictions = [
            [
                label_list[p]
                for p, label_id in zip(prediction, label, strict=False)
                if label_id != -100
            ]
            for prediction, label in zip(predictions, labels, strict=False)
        ]

        true_labels = [
            [
                label_list[label_id]
                for p, label_id in zip(prediction, label, strict=False)
                if label_id != -100
            ]
            for prediction, label in zip(predictions, labels, strict=False)
        ]

        result = seqeval.compute(
            predictions=true_predictions,
            references=true_labels,
            mode="strict",
            scheme=scheme,
        )

        return {
            "accuracy": result["overall_accuracy"],
            "precision": result["overall_precision"],
            "recall": result["overall_recall"],
            "f1": result["overall_f1"],
        }

    return compute_metrics