Data Quality Control¶
"Garbage in, garbage out." This saying is especially true for training deep learning models. Ensuring your DNA data is clean, consistent, and free of errors is a crucial step before training. This guide outlines common quality control checks.
1. Common Data Issues and Solutions¶
Invalid Characters¶
Problem: Your DNA sequences might contain characters other than A
, C
, G
, T
, and N
(for unknown nucleotides). These can come from parsing errors or malformed source files. Most tokenizers will fail or produce incorrect tokens if they encounter unexpected characters like U
, R
, Y
, or punctuation.
Solution: Sanitize your sequences to ensure they only contain valid characters.
The dnallm.datahandling.data.clean_sequence
function is designed for this purpose.
from dnallm.datahandling.data import clean_sequence
# Apply this function to every sequence in your dataset before saving.
# For a DNADataset object:
# dna_ds.validate_sequences(valid_chars="ACGTN")
Inconsistent Sequence Lengths¶
Problem: For some biological tasks, all sequences are expected to be the same length (e.g., classifying 150bp promoter regions). Drastic variations in length might indicate data collection errors.
Solution: Analyze the distribution of sequence lengths in your dataset.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("your_dataset.csv")
lengths = df["sequence"].str.len()
print(lengths.describe())
# Plot a histogram to visualize the distribution
lengths.hist(bins=50)
plt.title("Distribution of Sequence Lengths")
plt.xlabel("Length")
plt.ylabel("Frequency")
plt.show()
Label Imbalance¶
Problem: In classification tasks, having a severe imbalance between classes (e.g., 99% negative examples and 1% positive examples) can cause the model to simply predict the majority class every time.
Solution:
- Check Class Distribution: Use df['label'].value_counts()
to see the number of samples per class.
- Resampling:
- Oversampling: Randomly duplicate samples from the minority class.
- Undersampling: Randomly remove samples from the majority class.
- Weighted Loss: During training, you can assign a higher weight to the minority class in the loss function. The DNALLM finetune
command can handle this if class weights are provided.