NER Data Generation
In [1]:
Copied!
# First install two dependencies for generating NER datasets, and make sure you have installed bedtools in your system
# please make sure you have installed bedtools in your system
# !conda install -c bioconda bedtools
# or in macos, you can use brew install bedtools
!brew install bedtools
# Omit it if you have already installed these two packages
!uv pip install pyfastx pybedtools
# First install two dependencies for generating NER datasets, and make sure you have installed bedtools in your system
# please make sure you have installed bedtools in your system
# !conda install -c bioconda bedtools
# or in macos, you can use brew install bedtools
!brew install bedtools
# Omit it if you have already installed these two packages
!uv pip install pyfastx pybedtools
==> Auto-updating Homebrew... Adjust how often this is run with `$HOMEBREW_AUTO_UPDATE_SECS` or disable with `$HOMEBREW_NO_AUTO_UPDATE=1`. Hide these hints with `$HOMEBREW_NO_ENV_HINTS=1` (see `man brew`). ==> Downloading https://ghcr.io/v2/homebrew/core/portable-ruby/blobs/sha256:1c98fa49eacc935640a6f8e10a2bf33f14cfc276804b71ddb658ea45ba99d167 ######################################################################### 100.0%41.5% 50.3% 57.0% 58.8% 59.0% 59.8% 62.0% 63.1% 65.3% 66.3% 67.8%8.5% 71.2%1.9% 72.6%###### 72.8%.0% 75.1% 75.6% 76.3% 76.6%## 76.9%################################### 77.1% 78.2% 78.4%########### 78.9%%80.1% 80.1% 80.8%####### 84.5% 85.3% 85.6%### 85.7%## 86.6% 87.3%##### 89.0% 91.9% 93.0% 93.1%############### 94.4%############### 97.5%##################### 98.0% 98.7% ==> Pouring portable-ruby-3.4.8.arm64_big_sur.bottle.tar.gz ==> Auto-updated Homebrew! ==> Updated Homebrew from 5.0.3 (c3790fc6e4) to 5.0.7 (d61f229fd2). Updated 2 taps (homebrew/core and homebrew/cask). ==> New Formulae astra: Command-Line Interface for DataStax Astra bookokrat: Terminal EPUB Book Reader calm-cli: CLI allows you to interact with the Common Architecture Language Model (CALM) carl: Calendar for the command-line cinecli: Browse, inspect, and launch movie torrents directly from your terminal ctre: Compile-time PCRE-compatible regular expression matcher for C++ depot: Build your Docker images in the cloud dnspyre: CLI tool for a high QPS DNS benchmark docker-language-server: Language server for Dockerfiles, Compose files, and Bake files flecs: Fast entity component system for C & C++ garage: S3 object store so reliable you can run it outside datacenters git-get: Better way to clone, organize and manage multiple git repositories git-pages: Scalable static site server for Git forges git-pages-cli: Tool for publishing a site to a git-pages server goat: General purpose AT Protocol CLI in Go gup: Update binaries installed by go install hayagriva: Bibliography management tool jsonfmt: Like gofmt, but for JSON files khaos: Kafka traffic simulator for observability and chaos engineering klog: Command-line tool for time tracking in a human-readable, plain-text file format [email protected]: Kubernetes command-line interface kyua: Testing framework for infrastructure software libevdev: Wrapper library for evdev devices lispkit: Scheme framework for extension and scripting languages on macOS and iOS macchanger: Change your mac address, for macOS mapscii: Whole World In Your Console mcp-scan: Constrain, log and scan your MCP connections for security vulnerabilities mistral-vibe: Minimal CLI coding agent mole: Deep clean and optimize your Mac mq: Jq-like command-line tool for markdown processing mufetch: Neofetch-style music cli neo4j-mcp: Neo4j official Model Context Protocol server for AI tools netshow: Interactive network connection monitor with friendly service names nkt: TUI for fast and simple interacting with your BibLaTeX database octodns: Tools for managing DNS across multiple providers papis: Powerful command-line document and bibliography manager phantom: CLI tool for seamless parallel development with Git worktrees pixlet: App runtime and UX toolkit for pixel-based apps pony-language-server: Language server for Pony rad: Modern CLI scripts made easy redu: Ncdu for your restic repository rmrfrs: Filesystem cleaning tool rockcraft: Tool to create OCI images using the language from Snapcraft and Charmcraft [email protected]: Powerful, clean, object-oriented scripting language slicot: Fortran subroutines library for systems and control snitch: Prettier way to inspect network connections superseedr: BitTorrent Client in your Terminal svu: Semantic version utility talm: Manage Talos Linux configurations the GitOps way tfclean: Remove applied moved block, import block, etc [email protected]: Incremental parsing library tronbyt-server: Manage your apps on your Tronbyt (flashed Tidbyt) completely locally ty: Extremely fast Python type checker, written in Rust vacuum: World's fastest OpenAPI & Swagger linter wasm-bindgen: Facilitating high-level interactions between Wasm modules and JavaScript wifitui: Fast featureful friendly wifi terminal UI witr: Why is this running? wuchale: Protobuf-like i18n from plain code ==> New Casks 8bitdo-ultimate-software-v2: Control every piece of your controller airscroll: Smooth mouse scrolling utility aks-desktop: Azure Kubernetes Service desktop application alma: AI chat application appbox: iOS app distribution tool athas: Lightweight code editor cardinal-search: Fastest file searching tool comet: Web browser with integrated AI assistant conar: AI-powered database and data management tool copilot-cli: Brings the power of Copilot coding agent directly to your terminal copilot-cli@prerelease: Brings the power of Copilot coding agent directly to your terminal datadog-security-cli: Datadog Security Product CLI digiexam: Academic testing platform with device lockdown dnclient: Peer-to-peer VPN client for managed nebula networks elgato-studio: Capture and manage Elgato devices for content creation font-amarna font-bbh-bartle font-bbh-bogle font-bbh-hegarty font-cause font-geom font-guguru-sans-code font-guguru-sans-code-nf font-gveret-levin font-line-seed-jp font-sekuya fontra-pak: Browser-based font editor glkvm: App for controlling GL.iNet KVM devices hyperwhisper: AI-powered speech-to-text transcription locu: Daily planner and focus timer m32-edit: Remote control for Midas M32 audio consoles macdown-3000: Markdown editor with live preview and syntax highlighting mace: Simplify compliance baseline creation, auditing, and management maestro: AI agent command center maru-jan: Play japanese mahjong online monocle-app: Window dimming utility motionik: Screen recording software mountmate: Menubar app to easily manage external drives mozregression-gui: Interactive regression range finder for Firefox and other Mozilla products mpluginmanager: Installer for MeldaProduction audio plugins nanoleaf: Control your Nanoleaf lights nessie-app: Knowledge base from AI chats opencode-desktop: AI coding agent desktop client oracle-data-modeler: Graphical tool for data modeling tasks papercut-mobility-print-client: Client for printing to PaperCut Mobility Print queues portalbox: Share a region of your screen in video calls rocketman-choices-packager: Utility for customising installer package choices smartsheet: Spreadsheet-style project management solution snapmaker-orca: Slicing software for Snapmaker 3D printers, a fork of OrcaSlicer sourcegit: Git GUI client stirling-pdf: PDF utility stremio@beta: Open-source media center support: Menu bar app for user and help desk support swiftdialog: Admin utility that presents custom dialogs or messages from shell scripts typeless: AI voice dictation that turns speech into polished text uuremote: NetEase UU remote desktop access and control tool vcamapp: Face-tracking virtual avatar app visualdiffer: Visually compare folders and files vocaster-hub: Interface controller for Focusrite Vocaster One and Two wailbrew: Manage Homebrew packages with a UI wireless-workbench: Desktop app for RF coordination and wireless system management xmlmind-editor: Strictly validating near WYSIWYG XML editor yingfu-online: Education app for teens yoink: Drag and drop utility zo: Friendly personal server You have 46 outdated formulae installed. The 5.0.7 changelog can be found at: https://github.com/Homebrew/brew/releases/tag/5.0.7 ==> Fetching downloads for: bedtools ⠋ Bottle Manifest bedtools (2.31.1)⠋ Bottle Manifest bedtools (2.31.1)⠙ Bottle Manifest bedtools (2.31.1)⠙ Bottle Manifest bedtools (2.31.1)⠚ Bottle Manifest bedtools (2.31.1)⠚ Bottle Manifest bedtools (2.31.1)⠞ Bottle Manifest bedtools (2.31.1)⠞ Bottle Manifest bedtools (2.31.1)⠖ Bottle Manifest bedtools (2.31.1)⠖ Bottle Manifest bedtools (2.31.1)⠦ Bottle Manifest bedtools (2.31.1)⠦ Bottle Manifest bedtools (2.31.1)⠴ Bottle Manifest bedtools (2.31.1)⠴ Bottle Manifest bedtools (2.31.1) ---------------- Downloading 4.1KB/-------⠲ Bottle Manifest bedtools (2.31.1) ---------------- Downloading 4.1KB/-------✔︎ Bottle Manifest bedtools (2.31.1) Downloaded 13.6KB/ 13.6KB ⠳ Bottle bedtools (2.31.1)⠳ Bottle bedtools (2.31.1)⠓ Bottle bedtools (2.31.1)⠓ Bottle bedtools (2.31.1)⠋ Bottle bedtools (2.31.1)⠋ Bottle bedtools (2.31.1)⠙ Bottle bedtools (2.31.1)⠙ Bottle bedtools (2.31.1)⠚ Bottle bedtools (2.31.1)⠚ Bottle bedtools (2.31.1)⠞ Bottle bedtools (2.31.1)⠞ Bottle bedtools (2.31.1)⠖ Bottle bedtools (2.31.1)⠖ Bottle bedtools (2.31.1)⠦ Bottle bedtools (2.31.1)⠦ Bottle bedtools (2.31.1)⠴ Bottle bedtools (2.31.1)⠴ Bottle bedtools (2.31.1)⠲ Bottle bedtools (2.31.1)⠲ Bottle bedtools (2.31.1)⠳ Bottle bedtools (2.31.1)⠳ Bottle bedtools (2.31.1)⠓ Bottle bedtools (2.31.1)⠓ Bottle bedtools (2.31.1)⠋ Bottle bedtools (2.31.1) ------------------------- Downloading 12.3KB/633.1KB⠋ Bottle bedtools (2.31.1) #------------------------ Downloading 24.6KB/633.1KB⠙ Bottle bedtools (2.31.1) #------------------------ Downloading 24.6KB/633.1KB⠙ Bottle bedtools (2.31.1) #------------------------ Downloading 24.6KB/633.1KB⠚ Bottle bedtools (2.31.1) #------------------------ Downloading 24.6KB/633.1KB⠚ Bottle bedtools (2.31.1) ###---------------------- Downloading 77.8KB/633.1KB⠞ Bottle bedtools (2.31.1) ####--------------------- Downloading 98.3KB/633.1KB⠞ Bottle bedtools (2.31.1) #####-------------------- Downloading 131.1KB/633.1KB⠖ Bottle bedtools (2.31.1) #####-------------------- Downloading 131.1KB/633.1KB⠖ Bottle bedtools (2.31.1) #####-------------------- Downloading 131.1KB/633.1KB⠦ Bottle bedtools (2.31.1) #####-------------------- Downloading 139.3KB/633.1KB⠦ Bottle bedtools (2.31.1) #####-------------------- Downloading 139.3KB/633.1KB⠴ Bottle bedtools (2.31.1) #####-------------------- Downloading 139.3KB/633.1KB⠴ Bottle bedtools (2.31.1) ########----------------- Downloading 196.6KB/633.1KB⠲ Bottle bedtools (2.31.1) ########----------------- Downloading 208.9KB/633.1KB⠲ Bottle bedtools (2.31.1) #########---------------- Downloading 217.1KB/633.1KB⠳ Bottle bedtools (2.31.1) #########---------------- Downloading 217.1KB/633.1KB⠳ Bottle bedtools (2.31.1) #########---------------- Downloading 217.1KB/633.1KB⠓ Bottle bedtools (2.31.1) #########---------------- Downloading 237.6KB/633.1KB⠓ Bottle bedtools (2.31.1) #########---------------- Downloading 237.6KB/633.1KB⠋ Bottle bedtools (2.31.1) ##########--------------- Downloading 258.0KB/633.1KB⠋ Bottle bedtools (2.31.1) ##########--------------- Downloading 258.0KB/633.1KB⠙ Bottle bedtools (2.31.1) ##########--------------- Downloading 262.1KB/633.1KB⠙ Bottle bedtools (2.31.1) ###########-------------- Downloading 270.3KB/633.1KB⠚ Bottle bedtools (2.31.1) ###########-------------- Downloading 278.5KB/633.1KB⠚ Bottle bedtools (2.31.1) ###########-------------- Downloading 278.5KB/633.1KB⠞ Bottle bedtools (2.31.1) ###########-------------- Downloading 278.5KB/633.1KB⠞ Bottle bedtools (2.31.1) ############------------- Downloading 303.1KB/633.1KB⠖ Bottle bedtools (2.31.1) ############------------- Downloading 307.2KB/633.1KB⠖ Bottle bedtools (2.31.1) ############------------- Downloading 307.2KB/633.1KB⠦ Bottle bedtools (2.31.1) #############------------ Downloading 319.5KB/633.1KB⠦ Bottle bedtools (2.31.1) #############------------ Downloading 319.5KB/633.1KB⠴ Bottle bedtools (2.31.1) #############------------ Downloading 331.8KB/633.1KB⠴ Bottle bedtools (2.31.1) #############------------ Downloading 335.9KB/633.1KB⠲ Bottle bedtools (2.31.1) ##############----------- Downloading 344.1KB/633.1KB⠲ Bottle bedtools (2.31.1) ##############----------- Downloading 352.3KB/633.1KB⠳ Bottle bedtools (2.31.1) ##############----------- Downloading 360.4KB/633.1KB⠳ Bottle bedtools (2.31.1) ###############---------- Downloading 368.6KB/633.1KB⠓ Bottle bedtools (2.31.1) ###############---------- Downloading 372.7KB/633.1KB⠓ Bottle bedtools (2.31.1) ###############---------- Downloading 372.7KB/633.1KB⠋ Bottle bedtools (2.31.1) ###############---------- Downloading 385.0KB/633.1KB⠋ Bottle bedtools (2.31.1) ################--------- Downloading 393.2KB/633.1KB⠙ Bottle bedtools (2.31.1) ################--------- Downloading 401.4KB/633.1KB⠙ Bottle bedtools (2.31.1) ################--------- Downloading 405.5KB/633.1KB⠚ Bottle bedtools (2.31.1) ################--------- Downloading 417.8KB/633.1KB⠚ Bottle bedtools (2.31.1) #################-------- Downloading 426.0KB/633.1KB⠞ Bottle bedtools (2.31.1) #################-------- Downloading 434.2KB/633.1KB⠞ Bottle bedtools (2.31.1) ##################------- Downloading 446.5KB/633.1KB⠖ Bottle bedtools (2.31.1) ##################------- Downloading 446.5KB/633.1KB⠖ Bottle bedtools (2.31.1) ##################------- Downloading 462.8KB/633.1KB⠦ Bottle bedtools (2.31.1) ##################------- Downloading 462.8KB/633.1KB⠦ Bottle bedtools (2.31.1) ###################------ Downloading 471.0KB/633.1KB⠴ Bottle bedtools (2.31.1) ###################------ Downloading 483.3KB/633.1KB⠴ Bottle bedtools (2.31.1) ###################------ Downloading 491.5KB/633.1KB⠲ Bottle bedtools (2.31.1) ####################----- Downloading 499.7KB/633.1KB⠲ Bottle bedtools (2.31.1) ####################----- Downloading 503.8KB/633.1KB⠳ Bottle bedtools (2.31.1) ####################----- Downloading 503.8KB/633.1KB⠳ Bottle bedtools (2.31.1) #####################---- Downloading 528.4KB/633.1KB⠓ Bottle bedtools (2.31.1) #####################---- Downloading 528.4KB/633.1KB⠓ Bottle bedtools (2.31.1) #####################---- Downloading 540.7KB/633.1KB⠋ Bottle bedtools (2.31.1) #####################---- Downloading 540.7KB/633.1KB⠋ Bottle bedtools (2.31.1) ######################--- Downloading 548.9KB/633.1KB⠙ Bottle bedtools (2.31.1) ######################--- Downloading 561.2KB/633.1KB⠙ Bottle bedtools (2.31.1) ######################--- Downloading 569.3KB/633.1KB⠚ Bottle bedtools (2.31.1) ######################--- Downloading 569.3KB/633.1KB⠚ Bottle bedtools (2.31.1) ######################--- Downloading 569.3KB/633.1KB⠞ Bottle bedtools (2.31.1) #######################-- Downloading 585.7KB/633.1KB⠞ Bottle bedtools (2.31.1) #######################-- Downloading 585.7KB/633.1KB⠖ Bottle bedtools (2.31.1) ########################- Downloading 602.1KB/633.1KB⠖ Bottle bedtools (2.31.1) ########################- Downloading 602.1KB/633.1KB⠦ Bottle bedtools (2.31.1) ########################- Downloading 602.1KB/633.1KB⠦ Bottle bedtools (2.31.1) ########################- Downloading 602.1KB/633.1KB⠴ Bottle bedtools (2.31.1) ########################- Downloading 610.3KB/633.1KB⠴ Bottle bedtools (2.31.1) ######################### Downloading 622.6KB/633.1KB⠲ Bottle bedtools (2.31.1) ######################### Downloading 622.6KB/633.1KB⠲ Bottle bedtools (2.31.1) ######################### Downloading 626.7KB/633.1KB⠳ Bottle bedtools (2.31.1) Extracting 633.1KB/633.1KB✔︎ Bottle bedtools (2.31.1) Downloaded 633.1KB/633.1KB ==> Pouring bedtools--2.31.1.arm64_tahoe.bottle.tar.gz 🍺 /opt/homebrew/Cellar/bedtools/2.31.1: 43 files, 1.7MB ==> Running `brew cleanup bedtools`... Disable this behaviour by setting `HOMEBREW_NO_INSTALL_CLEANUP=1`. Hide these hints with `HOMEBREW_NO_ENV_HINTS=1` (see `man brew`). ==> `brew cleanup` has not been run in the last 30 days, running now... Disable this behaviour by setting `HOMEBREW_NO_INSTALL_CLEANUP=1`. Hide these hints with `HOMEBREW_NO_ENV_HINTS=1` (see `man brew`). Removing: /Users/forrest/Library/Caches/Homebrew/portable-ruby-3.4.7.arm64_big_sur.bottle.tar.gz... (12.2MB) Removing: /Users/forrest/Library/Caches/Homebrew/bootsnap/e480d26faefcc0649aee4a1b4bb5df1bf6ea887486b0f1485789bf3d26f5e27a... (650 files, 5.6MB) Removing: /Users/forrest/Library/Caches/Homebrew/bootsnap/42e939983ed75547f42207cad9f1e0fde134291f63f94bcb8df8abbd25416d42... (635 files, 5.5MB) Removing: /Users/forrest/Library/Logs/Homebrew/ca-certificates... (64B) Using Python 3.12.8 environment at: /Users/forrest/GitHub/DNALLM/.venv Resolved 9 packages in 984ms Installed 3 packages in 17ms + pybedtools==0.12.0 + pyfastx==2.2.0 + pysam==0.23.3
In [2]:
Copied!
# verify intersectBed
!which intersectBed
# verify intersectBed
!which intersectBed
/Users/forrest/miniconda3/bin/intersectBed
In [3]:
Copied!
import gzip
import random
import numpy as np
from pyfastx import Fasta
from pybedtools import BedTool
from tqdm import tqdm
from collections import defaultdict
import pickle
from dnallm import load_config, load_model_and_tokenizer
import gzip
import random
import numpy as np
from pyfastx import Fasta
from pybedtools import BedTool
from tqdm import tqdm
from collections import defaultdict
import pickle
from dnallm import load_config, load_model_and_tokenizer
In [4]:
Copied!
# Set random seed
random.seed(42)
# Set random seed
random.seed(42)
In [5]:
Copied!
# Set minimum and maximum extend length around a gene
min_ext = 50
max_ext = 100
ext_list = [[random.randint(min_ext, max_ext), random.randint(min_ext, max_ext)] for x in range(60000)]
# Set minimum and maximum extend length around a gene
min_ext = 50
max_ext = 100
ext_list = [[random.randint(min_ext, max_ext), random.randint(min_ext, max_ext)] for x in range(60000)]
In [6]:
Copied!
# Define Named Entity Recognition (NER) tags and corresponding id
# NER includes IO, IOB, IOE, IOBES, BI, IE and BIES schemes, here we use IOB scheme
# Example:
# ..........[ exon1 ]-----[ exon2 ]-------[ exon3 ]........
# 000000000012222222234444122222222344444412222222200000000
named_entities = {
'intergenic': 'O',
'exon0': 'B-EXON',
'exon1': 'I-EXON',
'intron0': 'B-INTRON',
'intron1': 'I-INTRON',
}
tags_id = {
'O': 0,
'B-EXON': 1,
'I-EXON': 2,
'B-INTRON': 3,
'I-INTRON': 4,
}
# Define Named Entity Recognition (NER) tags and corresponding id
# NER includes IO, IOB, IOE, IOBES, BI, IE and BIES schemes, here we use IOB scheme
# Example:
# ..........[ exon1 ]-----[ exon2 ]-------[ exon3 ]........
# 000000000012222222234444122222222344444412222222200000000
named_entities = {
'intergenic': 'O',
'exon0': 'B-EXON',
'exon1': 'I-EXON',
'intron0': 'B-INTRON',
'intron1': 'I-INTRON',
}
tags_id = {
'O': 0,
'B-EXON': 1,
'I-EXON': 2,
'B-INTRON': 3,
'I-INTRON': 4,
}
In [7]:
Copied!
def get_gene_annotation(gene_anno):
cnt = 0
gene_info = {}
for gene in gene_anno:
gene_info[gene] = []
chrom = gene_anno[gene]["chrom"]
start = gene_anno[gene]["start"]
end = gene_anno[gene]["end"]
strand = gene_anno[gene]["strand"]
isoforms = gene_anno[gene]["isoform"]
# Get representative isoform(longest)
if not isoforms:
continue
lso_lens = [(iso, sum([(x[2]-x[1]) for x in isoforms[iso]])) for iso in isoforms]
representative = sorted(lso_lens, key=lambda x:x[1])[-1][0]
isoform_info = isoforms[representative]
iso_start = min([x[1] for x in isoform_info])
iso_end = max([x[2] for x in isoform_info])
if iso_start == start and iso_end == end:
is_reverse = False if strand == "+" else True
# Get intron annotation
last = 0
for region in sorted(isoform_info, key=lambda x:x[1], reverse=is_reverse):
if strand == "+":
if last:
intron = [chrom, last, region[1], "intron", strand]
if intron[1] < intron[2]:
gene_info[gene].append(intron)
last = region[2]
else:
if last:
intron = [chrom, region[2], last, "intron", strand]
if intron[1] < intron[2]:
gene_info[gene].append(intron)
last = region[1]
gene_info[gene].append([chrom, region[1], region[2], region[0], strand])
cnt += 1
return gene_info
def get_gene_annotation(gene_anno):
cnt = 0
gene_info = {}
for gene in gene_anno:
gene_info[gene] = []
chrom = gene_anno[gene]["chrom"]
start = gene_anno[gene]["start"]
end = gene_anno[gene]["end"]
strand = gene_anno[gene]["strand"]
isoforms = gene_anno[gene]["isoform"]
# Get representative isoform(longest)
if not isoforms:
continue
lso_lens = [(iso, sum([(x[2]-x[1]) for x in isoforms[iso]])) for iso in isoforms]
representative = sorted(lso_lens, key=lambda x:x[1])[-1][0]
isoform_info = isoforms[representative]
iso_start = min([x[1] for x in isoform_info])
iso_end = max([x[2] for x in isoform_info])
if iso_start == start and iso_end == end:
is_reverse = False if strand == "+" else True
# Get intron annotation
last = 0
for region in sorted(isoform_info, key=lambda x:x[1], reverse=is_reverse):
if strand == "+":
if last:
intron = [chrom, last, region[1], "intron", strand]
if intron[1] < intron[2]:
gene_info[gene].append(intron)
last = region[2]
else:
if last:
intron = [chrom, region[2], last, "intron", strand]
if intron[1] < intron[2]:
gene_info[gene].append(intron)
last = region[1]
gene_info[gene].append([chrom, region[1], region[2], region[0], strand])
cnt += 1
return gene_info
In [8]:
Copied!
def tokenization(genome, gene_anno, gene_info, tokenizer, outfile, ext_list, sampling=1e7):
"""
For each gene in `gene_anno`, extract the annotated exonic (and flanking) DNA subsequences,
run the tokenizer once per subsequence with offset_mapping enabled,
and record the (genomic_start, genomic_end, token) tuples for all non-special tokens.
- genome: dict mapping chromosome → SeqRecord (so that genome[chrom][start:end].seq gives a Seq)
- gene_anno: dict mapping gene_name → { "chrom": str, "strand": "+" or "-", ... }
- gene_info: dict mapping gene_name → list of (feature_id, exon_start, exon_end) or similar
- tokenizer: a HuggingFace‐style tokenizer that supports return_offsets_mapping
- outfile: (unused here, but you can write token_pos to it later)
- ext_list: list of (left_extension, right_extension) tuples parallel to gene_anno order
- sampling: random sampling the given number of genes for tokenization
"""
# 1) Precompute special-tokens set for O(1) membership checks
sp_tokens = set(tokenizer.special_tokens_map.values())
token_pos = {}
# Since gene_anno is likely a dict, we need a stable way of iterating + indexing ext_list.
# We'll assume ext_list[i] corresponds to the i-th gene in `list(gene_anno.keys())`.
gene_list = list(gene_anno.keys())
if len(gene_list) > sampling:
gene_list = random.sample(gene_list, int(sampling))
for num, gene in enumerate(tqdm(gene_list, desc="Genes")):
chrom = gene_anno[gene]["chrom"]
strand = gene_anno[gene]["strand"]
# Skip genes not in gene_info or with empty annotation
if gene not in gene_info or not gene_info[gene]:
continue
# Determine exon‐range and extended boundaries
exon_coords = gene_info[gene]
# start = minimum exon_start; end = maximum exon_end
start = min(exon[1] for exon in exon_coords)
end = max(exon[2] for exon in exon_coords)
left_ext, right_ext = ext_list[num]
ext_start = max(0, start - left_ext)
ext_end = end + right_ext
# Shortcut: grab the full chromosome record once
chrom_record = genome[chrom]
# Build a list of (genomic_anchor, seq_string) for "+" or "-" strand
seqinfo = []
if strand == "+":
# 1) upstream flank
try:
upstream_seq = chrom_record[ext_start:start].seq
except Exception:
# If slicing fails, log and skip
print(f"ERROR: {chrom}\t{ext_start}\t{start}")
upstream_seq = ""
seqinfo.append((ext_start, str(upstream_seq)))
# 2) each exon
for feature in exon_coords:
exon_start = feature[1]
exon_end = feature[2]
if exon_start >= exon_end:
continue
seq = chrom_record[exon_start:exon_end].seq
seqinfo.append((exon_start, str(seq)))
# 3) downstream flank
downstream_seq = chrom_record[end:ext_end].seq
seqinfo.append((end, str(downstream_seq)))
else: # strand == "-"
# On the reverse‐strand, we want the reverse complement ("antisense").
# Note: .antisense == .reverse_complement() for most SeqRecord slicing.
# We still record the genomic anchor as if it were the left index on the + strand.
# But because the sequence is reversed, offset_mapping will need to be mapped differently.
# 1) “upstream” on reverse strand = (end → ext_end) in forward coords, but take antisense
try:
flank_seq = chrom_record[end:ext_end].antisense
except Exception:
print(f"ERROR (rev): {chrom}\t{end}\t{ext_end}")
flank_seq = ""
seqinfo.append((ext_end, str(flank_seq)))
# 2) each exon (reverse‐complement)
for feature in exon_coords:
exon_start = feature[1]
exon_end = feature[2]
if exon_start >= exon_end:
continue
seq = chrom_record[exon_start:exon_end].antisense
# For mapping, we’ll anchor each token by the 5′-most position on the minus strand,
# but because the sequence is reversed, the “first character” of seq actually corresponds
# to genomic position = exon_end - 1 in forward coordinates, and the “last character” ↦ exon_start.
seqinfo.append((exon_end, str(seq)))
# 3) downstream on reverse strand = (ext_start → start) in forward coords, but antisense
flank_seq = chrom_record[ext_start:start].antisense
seqinfo.append((start, str(flank_seq)))
# Initialize the list for this gene
token_pos[gene] = []
# For each (anchor, raw_seq), run a single tokenizer(...) call
for anchor, raw_seq in seqinfo:
if not raw_seq:
continue
# 1) Tokenize with offsets (add_special_tokens=False so we skip [CLS], [SEP], etc.)
# “offset_mapping” is a list of (char_start, char_end) for each token in raw_seq.
# encoding = tokenizer(
# raw_seq,
# return_offsets_mapping=True,
# add_special_tokens=False
# )
# offsets = encoding["offset_mapping"]
# token_ids = encoding["input_ids"]
token_ids = tokenizer.encode(raw_seq, add_special_tokens=False)
tok_strs = tokenizer.convert_ids_to_tokens(token_ids)
offsets = []
cursor = 0
for tok in tok_strs:
char_start = cursor
char_end = cursor + len(tok)
offsets.append((char_start, char_end))
cursor = char_end
if len(offsets) != len(token_ids):
# This should never happen in a well‐formed tokenizer, but just in case:
raise RuntimeError("Offset mapping length ≠ token_ids length")
# 2) Iterate through each token + offset, skip special tokens, then map back to genome coords
for idx, (token_id, (char_start, char_end)) in enumerate(zip(token_ids, offsets)):
token_str = tokenizer.convert_ids_to_tokens(token_id)
# Skip if it’s one of the special tokens (“[PAD]”, “[CLS]”, etc.)
if token_str in sp_tokens:
continue
if strand == "+":
# On the forward strand, raw_seq[0] ↦ genomic position “anchor”.
# So any token covering raw_seq[char_start:char_end] ↦ genome positions [anchor+char_start : anchor+char_end]
g_start = anchor + char_start
g_end = anchor + char_end
else:
# On the reverse strand, raw_seq was already antisense (reverse), and “anchor” is the forward‐strand coordinate
# of the first character in raw_seq. That first character of raw_seq is actually genome position (anchor-1),
# and the last character of raw_seq is genome position (anchor - len(raw_seq)).
# More generally, for raw_seq index i, the corresponding forward‐strand position is:
# g_pos = anchor - 1 - i
#
# Thus, if the token covers raw_seq[char_start:char_end] (i.e. from i = char_start to i = char_end-1),
# its genomic coordinates (inclusive‐exclusive) on the forward strand are:
# g_end = (anchor - 1 - char_start) + 1 = anchor - char_start
# g_start = (anchor - 1 - (char_end - 1)) = anchor - char_end
#
# We want to store them as [g_start, g_end] with g_start < g_end. So:
g_start = anchor - char_end
g_end = anchor - char_start
token_pos[gene].append([g_start, g_end, token_str])
# save sequences and tokens
with open(outfile, "w") as outf:
for gene in tqdm(token_pos, desc="Save token positions"):
chrom = gene_anno[gene]["chrom"]
strand = gene_anno[gene]["strand"]
for token in token_pos[gene]:
print(chrom, token[0], token[1], token[2], gene, strand, sep="\t", file=outf)
return token_pos
def tokenization(genome, gene_anno, gene_info, tokenizer, outfile, ext_list, sampling=1e7):
"""
For each gene in `gene_anno`, extract the annotated exonic (and flanking) DNA subsequences,
run the tokenizer once per subsequence with offset_mapping enabled,
and record the (genomic_start, genomic_end, token) tuples for all non-special tokens.
- genome: dict mapping chromosome → SeqRecord (so that genome[chrom][start:end].seq gives a Seq)
- gene_anno: dict mapping gene_name → { "chrom": str, "strand": "+" or "-", ... }
- gene_info: dict mapping gene_name → list of (feature_id, exon_start, exon_end) or similar
- tokenizer: a HuggingFace‐style tokenizer that supports return_offsets_mapping
- outfile: (unused here, but you can write token_pos to it later)
- ext_list: list of (left_extension, right_extension) tuples parallel to gene_anno order
- sampling: random sampling the given number of genes for tokenization
"""
# 1) Precompute special-tokens set for O(1) membership checks
sp_tokens = set(tokenizer.special_tokens_map.values())
token_pos = {}
# Since gene_anno is likely a dict, we need a stable way of iterating + indexing ext_list.
# We'll assume ext_list[i] corresponds to the i-th gene in `list(gene_anno.keys())`.
gene_list = list(gene_anno.keys())
if len(gene_list) > sampling:
gene_list = random.sample(gene_list, int(sampling))
for num, gene in enumerate(tqdm(gene_list, desc="Genes")):
chrom = gene_anno[gene]["chrom"]
strand = gene_anno[gene]["strand"]
# Skip genes not in gene_info or with empty annotation
if gene not in gene_info or not gene_info[gene]:
continue
# Determine exon‐range and extended boundaries
exon_coords = gene_info[gene]
# start = minimum exon_start; end = maximum exon_end
start = min(exon[1] for exon in exon_coords)
end = max(exon[2] for exon in exon_coords)
left_ext, right_ext = ext_list[num]
ext_start = max(0, start - left_ext)
ext_end = end + right_ext
# Shortcut: grab the full chromosome record once
chrom_record = genome[chrom]
# Build a list of (genomic_anchor, seq_string) for "+" or "-" strand
seqinfo = []
if strand == "+":
# 1) upstream flank
try:
upstream_seq = chrom_record[ext_start:start].seq
except Exception:
# If slicing fails, log and skip
print(f"ERROR: {chrom}\t{ext_start}\t{start}")
upstream_seq = ""
seqinfo.append((ext_start, str(upstream_seq)))
# 2) each exon
for feature in exon_coords:
exon_start = feature[1]
exon_end = feature[2]
if exon_start >= exon_end:
continue
seq = chrom_record[exon_start:exon_end].seq
seqinfo.append((exon_start, str(seq)))
# 3) downstream flank
downstream_seq = chrom_record[end:ext_end].seq
seqinfo.append((end, str(downstream_seq)))
else: # strand == "-"
# On the reverse‐strand, we want the reverse complement ("antisense").
# Note: .antisense == .reverse_complement() for most SeqRecord slicing.
# We still record the genomic anchor as if it were the left index on the + strand.
# But because the sequence is reversed, offset_mapping will need to be mapped differently.
# 1) “upstream” on reverse strand = (end → ext_end) in forward coords, but take antisense
try:
flank_seq = chrom_record[end:ext_end].antisense
except Exception:
print(f"ERROR (rev): {chrom}\t{end}\t{ext_end}")
flank_seq = ""
seqinfo.append((ext_end, str(flank_seq)))
# 2) each exon (reverse‐complement)
for feature in exon_coords:
exon_start = feature[1]
exon_end = feature[2]
if exon_start >= exon_end:
continue
seq = chrom_record[exon_start:exon_end].antisense
# For mapping, we’ll anchor each token by the 5′-most position on the minus strand,
# but because the sequence is reversed, the “first character” of seq actually corresponds
# to genomic position = exon_end - 1 in forward coordinates, and the “last character” ↦ exon_start.
seqinfo.append((exon_end, str(seq)))
# 3) downstream on reverse strand = (ext_start → start) in forward coords, but antisense
flank_seq = chrom_record[ext_start:start].antisense
seqinfo.append((start, str(flank_seq)))
# Initialize the list for this gene
token_pos[gene] = []
# For each (anchor, raw_seq), run a single tokenizer(...) call
for anchor, raw_seq in seqinfo:
if not raw_seq:
continue
# 1) Tokenize with offsets (add_special_tokens=False so we skip [CLS], [SEP], etc.)
# “offset_mapping” is a list of (char_start, char_end) for each token in raw_seq.
# encoding = tokenizer(
# raw_seq,
# return_offsets_mapping=True,
# add_special_tokens=False
# )
# offsets = encoding["offset_mapping"]
# token_ids = encoding["input_ids"]
token_ids = tokenizer.encode(raw_seq, add_special_tokens=False)
tok_strs = tokenizer.convert_ids_to_tokens(token_ids)
offsets = []
cursor = 0
for tok in tok_strs:
char_start = cursor
char_end = cursor + len(tok)
offsets.append((char_start, char_end))
cursor = char_end
if len(offsets) != len(token_ids):
# This should never happen in a well‐formed tokenizer, but just in case:
raise RuntimeError("Offset mapping length ≠ token_ids length")
# 2) Iterate through each token + offset, skip special tokens, then map back to genome coords
for idx, (token_id, (char_start, char_end)) in enumerate(zip(token_ids, offsets)):
token_str = tokenizer.convert_ids_to_tokens(token_id)
# Skip if it’s one of the special tokens (“[PAD]”, “[CLS]”, etc.)
if token_str in sp_tokens:
continue
if strand == "+":
# On the forward strand, raw_seq[0] ↦ genomic position “anchor”.
# So any token covering raw_seq[char_start:char_end] ↦ genome positions [anchor+char_start : anchor+char_end]
g_start = anchor + char_start
g_end = anchor + char_end
else:
# On the reverse strand, raw_seq was already antisense (reverse), and “anchor” is the forward‐strand coordinate
# of the first character in raw_seq. That first character of raw_seq is actually genome position (anchor-1),
# and the last character of raw_seq is genome position (anchor - len(raw_seq)).
# More generally, for raw_seq index i, the corresponding forward‐strand position is:
# g_pos = anchor - 1 - i
#
# Thus, if the token covers raw_seq[char_start:char_end] (i.e. from i = char_start to i = char_end-1),
# its genomic coordinates (inclusive‐exclusive) on the forward strand are:
# g_end = (anchor - 1 - char_start) + 1 = anchor - char_start
# g_start = (anchor - 1 - (char_end - 1)) = anchor - char_end
#
# We want to store them as [g_start, g_end] with g_start < g_end. So:
g_start = anchor - char_end
g_end = anchor - char_start
token_pos[gene].append([g_start, g_end, token_str])
# save sequences and tokens
with open(outfile, "w") as outf:
for gene in tqdm(token_pos, desc="Save token positions"):
chrom = gene_anno[gene]["chrom"]
strand = gene_anno[gene]["strand"]
for token in token_pos[gene]:
print(chrom, token[0], token[1], token[2], gene, strand, sep="\t", file=outf)
return token_pos
In [9]:
Copied!
def tokens_to_nerdata(tokens_bed, annotation_bed, outfile, named_entities, tags_id):
"""
Build a token‐level NER dataset by intersecting `tokens_bed` with `annotation_bed`.
Returns a dict: { 'id': [...geneIDs...], 'sequence': [[token1, token2, …], …],
'labels': [[label1, label2, …], …] }
and also writes two files:
1) “outfile” as a pickle of ner_info,
2) “<outfile>.token_sizes” containing “gene<TAB>token_count” for each gene.
"""
ne = named_entities
# Build a map from “baseName + '0' → named_entities[...] → tags_id[...]”
zero_map = {}
one_map = {}
for base_name, ner_label in ne.items():
# “intergenic” maps to 'O' no matter whether we’re at a “start” or “inside” —
# so we do it for both 'intergenic0' and 'intergenic1'.
if base_name == "intergenic":
zero_map["intergenic0"] = ner_label
one_map["intergenic1"] = ner_label
continue
# base_name will be something like “exon0” or “exon1”, “intron0”, “intron1”
# We want to know, whenever the token’s name is exactly “exon” and we’re at a “start” boundary,
# pick the B-EXON label. If the name is “exon” but it matched the previous gene-level “name”,
# then we call named_entities["exon1"] to get “I-EXON”.
if base_name.endswith("0"):
zero_map[base_name] = ner_label
else:
one_map[base_name] = ner_label
# 2) Perform the intersection once (Loj = “left outer join”) so we keep every token
intersection = BedTool(tokens_bed).intersect(annotation_bed, loj=True)
# 3) Prepare our output containers
ner_info = {
"id": [], # list of gene IDs (in the same order as we append)
"sequence": [], # each element is a list-of-strings (tokens)
"labels": [] # each element is a list-of-ints (NER tags)
}
# We'll accumulate (gene, token_count) pairs in-memory, then write them in bulk
sizes_buffer = []
# 4) Use defaultdict(set) to track “which token‐IDs we’ve already seen for each gene”
token_seen = defaultdict(set)
current_gene = None
tokens_list = []
labels_list = []
last_name = None # to know if “name == last_name” (inside vs start)
# 5) Iterate through every interval from the intersection
# We rely on the fact that BedTool.intersect(...) returns results in ascending
# genomic order, and within each gene, that will appear “in order of token positions.”
for iv in intersection:
# Instead of “str(iv).split('\t')”, do:
chrom = iv.chrom
start = iv.start # integer
end = iv.end # integer
token = iv.name # 4th column of tokens_bed
gene = iv.fields[4] # 5th column of tokens_bed (original gene ID)
gene2 = iv.fields[9] # 10th field (unused here, but was in your code)
name = iv.fields[10] # 11th field = the annotation “name”
# Build a unique‐ID for this token instance
token_id = (start, end)
# 6) When we see a new gene (i.e. “gene != current_gene”), we flush the previous gene’s data
if gene != current_gene:
# flush old gene if it exists
if current_gene is not None:
# Only append if we actually collected ≥1 token for current_gene
if tokens_list:
sizes_buffer.append((current_gene, len(tokens_list)))
ner_info["id"].append(current_gene)
ner_info["sequence"].append(tokens_list)
ner_info["labels"].append(labels_list)
count = len(ner_info["id"])
if count % 100 == 0:
print(count)
# Reset for the new gene
current_gene = gene
tokens_list = []
labels_list = []
last_name = None
# 7) If we’ve already seen this exact (start,end) “token_id” for this gene, skip
if token_id in token_seen[gene]:
continue
token_seen[gene].add(token_id)
# 8) Determine the correct NER‐tag (integer) for this token
# - If name == "-1" → treat as “intergenic”
# - If name == last_name → we pick “inside” (use one_map[name + "1"])
# - else → we pick “start” (use zero_map[name + "0"])
if name == "-1":
base_name = "intergenic"
ner_label = ne[base_name] # always “O”
else:
# If it matched the previous token’s annotation name, choose inside
if name == last_name:
lookup_key = name + "1" # e.g. “exon1” → I-EXON
ner_label = one_map.get(lookup_key)
# If somehow it’s missing, fall back to “start” logic
if ner_label is None:
ner_label = zero_map[name + "0"]
else:
# new annotation segment → start
lookup_key = name + "0" # e.g. “exon0” → B-EXON
ner_label = zero_map.get(lookup_key)
# If it’s missing, fall back to “intergenic”
if ner_label is None:
ner_label = ne["intergenic"]
ner_tag = tags_id[ner_label]
last_name = name
# 9) Append the token string + numeric label
tokens_list.append(token)
labels_list.append(ner_tag)
# 10) Don’t forget to flush the final gene once the loop ends
if current_gene is not None and tokens_list:
sizes_buffer.append((current_gene, len(tokens_list)))
ner_info["id"].append(current_gene)
ner_info["sequence"].append(tokens_list)
ner_info["labels"].append(labels_list)
print(".", end="")
# 11) Write out the token_sizes file in one go
sizes_file = outfile.rsplit(".", 1)[0] + ".token_sizes"
with open(sizes_file, "w") as tsf:
for gene_name, count in sizes_buffer:
tsf.write(f"{gene_name}\t{count}\n")
# 12) Finally, pickle‐dump ner_info
with open(outfile, "wb") as handle:
pickle.dump(ner_info, handle, protocol=pickle.HIGHEST_PROTOCOL)
return ner_info
def tokens_to_nerdata(tokens_bed, annotation_bed, outfile, named_entities, tags_id):
"""
Build a token‐level NER dataset by intersecting `tokens_bed` with `annotation_bed`.
Returns a dict: { 'id': [...geneIDs...], 'sequence': [[token1, token2, …], …],
'labels': [[label1, label2, …], …] }
and also writes two files:
1) “outfile” as a pickle of ner_info,
2) “.token_sizes” containing “genetoken_count” for each gene.
"""
ne = named_entities
# Build a map from “baseName + '0' → named_entities[...] → tags_id[...]”
zero_map = {}
one_map = {}
for base_name, ner_label in ne.items():
# “intergenic” maps to 'O' no matter whether we’re at a “start” or “inside” —
# so we do it for both 'intergenic0' and 'intergenic1'.
if base_name == "intergenic":
zero_map["intergenic0"] = ner_label
one_map["intergenic1"] = ner_label
continue
# base_name will be something like “exon0” or “exon1”, “intron0”, “intron1”
# We want to know, whenever the token’s name is exactly “exon” and we’re at a “start” boundary,
# pick the B-EXON label. If the name is “exon” but it matched the previous gene-level “name”,
# then we call named_entities["exon1"] to get “I-EXON”.
if base_name.endswith("0"):
zero_map[base_name] = ner_label
else:
one_map[base_name] = ner_label
# 2) Perform the intersection once (Loj = “left outer join”) so we keep every token
intersection = BedTool(tokens_bed).intersect(annotation_bed, loj=True)
# 3) Prepare our output containers
ner_info = {
"id": [], # list of gene IDs (in the same order as we append)
"sequence": [], # each element is a list-of-strings (tokens)
"labels": [] # each element is a list-of-ints (NER tags)
}
# We'll accumulate (gene, token_count) pairs in-memory, then write them in bulk
sizes_buffer = []
# 4) Use defaultdict(set) to track “which token‐IDs we’ve already seen for each gene”
token_seen = defaultdict(set)
current_gene = None
tokens_list = []
labels_list = []
last_name = None # to know if “name == last_name” (inside vs start)
# 5) Iterate through every interval from the intersection
# We rely on the fact that BedTool.intersect(...) returns results in ascending
# genomic order, and within each gene, that will appear “in order of token positions.”
for iv in intersection:
# Instead of “str(iv).split('\t')”, do:
chrom = iv.chrom
start = iv.start # integer
end = iv.end # integer
token = iv.name # 4th column of tokens_bed
gene = iv.fields[4] # 5th column of tokens_bed (original gene ID)
gene2 = iv.fields[9] # 10th field (unused here, but was in your code)
name = iv.fields[10] # 11th field = the annotation “name”
# Build a unique‐ID for this token instance
token_id = (start, end)
# 6) When we see a new gene (i.e. “gene != current_gene”), we flush the previous gene’s data
if gene != current_gene:
# flush old gene if it exists
if current_gene is not None:
# Only append if we actually collected ≥1 token for current_gene
if tokens_list:
sizes_buffer.append((current_gene, len(tokens_list)))
ner_info["id"].append(current_gene)
ner_info["sequence"].append(tokens_list)
ner_info["labels"].append(labels_list)
count = len(ner_info["id"])
if count % 100 == 0:
print(count)
# Reset for the new gene
current_gene = gene
tokens_list = []
labels_list = []
last_name = None
# 7) If we’ve already seen this exact (start,end) “token_id” for this gene, skip
if token_id in token_seen[gene]:
continue
token_seen[gene].add(token_id)
# 8) Determine the correct NER‐tag (integer) for this token
# - If name == "-1" → treat as “intergenic”
# - If name == last_name → we pick “inside” (use one_map[name + "1"])
# - else → we pick “start” (use zero_map[name + "0"])
if name == "-1":
base_name = "intergenic"
ner_label = ne[base_name] # always “O”
else:
# If it matched the previous token’s annotation name, choose inside
if name == last_name:
lookup_key = name + "1" # e.g. “exon1” → I-EXON
ner_label = one_map.get(lookup_key)
# If somehow it’s missing, fall back to “start” logic
if ner_label is None:
ner_label = zero_map[name + "0"]
else:
# new annotation segment → start
lookup_key = name + "0" # e.g. “exon0” → B-EXON
ner_label = zero_map.get(lookup_key)
# If it’s missing, fall back to “intergenic”
if ner_label is None:
ner_label = ne["intergenic"]
ner_tag = tags_id[ner_label]
last_name = name
# 9) Append the token string + numeric label
tokens_list.append(token)
labels_list.append(ner_tag)
# 10) Don’t forget to flush the final gene once the loop ends
if current_gene is not None and tokens_list:
sizes_buffer.append((current_gene, len(tokens_list)))
ner_info["id"].append(current_gene)
ner_info["sequence"].append(tokens_list)
ner_info["labels"].append(labels_list)
print(".", end="")
# 11) Write out the token_sizes file in one go
sizes_file = outfile.rsplit(".", 1)[0] + ".token_sizes"
with open(sizes_file, "w") as tsf:
for gene_name, count in sizes_buffer:
tsf.write(f"{gene_name}\t{count}\n")
# 12) Finally, pickle‐dump ner_info
with open(outfile, "wb") as handle:
pickle.dump(ner_info, handle, protocol=pickle.HIGHEST_PROTOCOL)
return ner_info
In [10]:
Copied!
# Download genome and gene annotation (make sure you have wget command in your path)
!wget -c https://rice.uga.edu/osa1r7_download/osa1_r7.asm.fa.gz
!wget -c https://rice.uga.edu/osa1r7_download/osa1_r7.all_models.gff3.gz
# Download genome and gene annotation (make sure you have wget command in your path)
!wget -c https://rice.uga.edu/osa1r7_download/osa1_r7.asm.fa.gz
!wget -c https://rice.uga.edu/osa1r7_download/osa1_r7.all_models.gff3.gz
--2025-12-28 14:42:11-- https://rice.uga.edu/osa1r7_download/osa1_r7.asm.fa.gz Resolving rice.uga.edu (rice.uga.edu)... 128.192.162.131 Connecting to rice.uga.edu (rice.uga.edu)|128.192.162.131|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 115858754 (110M) [application/x-gzip] Saving to: ‘osa1_r7.asm.fa.gz’ osa1_r7.asm.fa.gz 100%[===================>] 110.49M 290KB/s in 3m 0s 2025-12-28 14:45:13 (629 KB/s) - ‘osa1_r7.asm.fa.gz’ saved [115858754/115858754] --2025-12-28 14:45:13-- https://rice.uga.edu/osa1r7_download/osa1_r7.all_models.gff3.gz Resolving rice.uga.edu (rice.uga.edu)... 128.192.162.131 Connecting to rice.uga.edu (rice.uga.edu)|128.192.162.131|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 8901387 (8.5M) [application/x-gzip] Saving to: ‘osa1_r7.all_models.gff3.gz’ osa1_r7.all_models. 100%[===================>] 8.49M 1.49MB/s in 10s 2025-12-28 14:45:25 (850 KB/s) - ‘osa1_r7.all_models.gff3.gz’ saved [8901387/8901387]
In [11]:
Copied!
# Load genome sequence
genome_file = "osa1_r7.asm.fa.gz"
genome = Fasta(genome_file)
# Load annotation
gene_anno = {}
with gzip.open("osa1_r7.all_models.gff3.gz", "rt") as infile:
for line in tqdm(infile):
if line.startswith("#") or line.startswith("\n"):
continue
info = line.strip().split("\t")
chrom = info[0]
datatype = info[2]
start = int(info[3]) - 1
end = int(info[4])
strand = info[6]
description = info[8].split(";")
if datatype == "gene":
for item in description:
if item.startswith("Name="):
gene = item[5:]
if gene not in gene_anno:
gene_anno[gene] = {}
gene_anno[gene]["chrom"] = chrom
gene_anno[gene]["start"] = start
gene_anno[gene]["end"] = end
gene_anno[gene]["strand"] = strand
gene_anno[gene]["isoform"] = {}
elif datatype in ["exon"]:
for item in description:
if item.startswith("Parent="):
isoform = item[7:].split(',')[0]
if isoform not in gene_anno[gene]["isoform"]:
gene_anno[gene]["isoform"][isoform] = []
gene_anno[gene]["isoform"][isoform].append([datatype, start, end])
# Get full gene annotation information and save
gene_info = get_gene_annotation(gene_anno)
annotation_bed = "rice_annotation.bed"
with open(annotation_bed, "w") as outf:
for gene in sorted(gene_anno, key=lambda x: (gene_anno[x]["chrom"], gene_anno[x]["start"])):
chrom = gene_anno[gene]["chrom"]
strand = gene_anno[gene]["strand"]
if strand == "+":
for item in gene_info[gene]:
print(item[0], item[1], item[2], gene, item[3], item[4], sep="\t", file=outf)
else:
for item in gene_info[gene][::-1]:
print(item[0], item[1], item[2], gene, item[3], item[4], sep="\t", file=outf)
# Load genome sequence
genome_file = "osa1_r7.asm.fa.gz"
genome = Fasta(genome_file)
# Load annotation
gene_anno = {}
with gzip.open("osa1_r7.all_models.gff3.gz", "rt") as infile:
for line in tqdm(infile):
if line.startswith("#") or line.startswith("\n"):
continue
info = line.strip().split("\t")
chrom = info[0]
datatype = info[2]
start = int(info[3]) - 1
end = int(info[4])
strand = info[6]
description = info[8].split(";")
if datatype == "gene":
for item in description:
if item.startswith("Name="):
gene = item[5:]
if gene not in gene_anno:
gene_anno[gene] = {}
gene_anno[gene]["chrom"] = chrom
gene_anno[gene]["start"] = start
gene_anno[gene]["end"] = end
gene_anno[gene]["strand"] = strand
gene_anno[gene]["isoform"] = {}
elif datatype in ["exon"]:
for item in description:
if item.startswith("Parent="):
isoform = item[7:].split(',')[0]
if isoform not in gene_anno[gene]["isoform"]:
gene_anno[gene]["isoform"][isoform] = []
gene_anno[gene]["isoform"][isoform].append([datatype, start, end])
# Get full gene annotation information and save
gene_info = get_gene_annotation(gene_anno)
annotation_bed = "rice_annotation.bed"
with open(annotation_bed, "w") as outf:
for gene in sorted(gene_anno, key=lambda x: (gene_anno[x]["chrom"], gene_anno[x]["start"])):
chrom = gene_anno[gene]["chrom"]
strand = gene_anno[gene]["strand"]
if strand == "+":
for item in gene_info[gene]:
print(item[0], item[1], item[2], gene, item[3], item[4], sep="\t", file=outf)
else:
for item in gene_info[gene][::-1]:
print(item[0], item[1], item[2], gene, item[3], item[4], sep="\t", file=outf)
813791it [00:01, 467221.72it/s]
In [12]:
Copied!
# Load configs, model and tokenizer
configs = load_config("./ner_task_config.yaml")
model_name = "zhangtaolab/plant-dnagpt-6mer"
model, tokenizer = load_model_and_tokenizer(model_name, task_config=configs['task'], source="modelscope")
# Load configs, model and tokenizer
configs = load_config("./ner_task_config.yaml")
model_name = "zhangtaolab/plant-dnagpt-6mer"
model, tokenizer = load_model_and_tokenizer(model_name, task_config=configs['task'], source="modelscope")
Downloading Model from https://www.modelscope.cn to directory: /Users/forrest/.cache/modelscope/hub/models/zhangtaolab/plant-dnagpt-6mer
2025-12-28 14:45:31,636 - modelscope - INFO - Got 8 files, start to download ...
Processing 8 items: 0%| | 0.00/8.00 [00:00<?, ?it/s]
Downloading [configuration.json]: 0%| | 0.00/42.0 [00:00<?, ?B/s]
Downloading [config.json]: 0%| | 0.00/910 [00:00<?, ?B/s]
Downloading [tokenizer_config.json]: 0%| | 0.00/1.07k [00:00<?, ?B/s]
Downloading [README.md]: 0%| | 0.00/2.73k [00:00<?, ?B/s]
Downloading [generation_config.json]: 0%| | 0.00/111 [00:00<?, ?B/s]
Downloading [special_tokens_map.json]: 0%| | 0.00/581 [00:00<?, ?B/s]
Downloading [vocab.txt]: 0%| | 0.00/28.0k [00:00<?, ?B/s]
Downloading [model.safetensors]: 0%| | 0.00/340M [00:00<?, ?B/s]
2025-12-28 14:46:04,983 - modelscope - INFO - Download model 'zhangtaolab/plant-dnagpt-6mer' successfully.
14:46:04 - dnallm.utils.support - INFO - Model files are stored in /Users/forrest/.cache/modelscope/hub/models/zhangtaolab/plant-dnagpt-6mer
Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at /Users/forrest/.cache/modelscope/hub/models/zhangtaolab/plant-dnagpt-6mer and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [13]:
Copied!
# Tokenize gene sequences and generate NER dataset format
print("# Performing sequence tokenization...")
tokens_bed = "rice_genes_tokens.bed"
token_pos = tokenization(genome, gene_anno, gene_info, tokenizer, tokens_bed, ext_list, sampling=2000)
# Tokenize gene sequences and generate NER dataset format
print("# Performing sequence tokenization...")
tokens_bed = "rice_genes_tokens.bed"
token_pos = tokenization(genome, gene_anno, gene_info, tokenizer, tokens_bed, ext_list, sampling=2000)
# Performing sequence tokenization...
Genes: 0%| | 5/2000 [00:00<00:44, 45.18it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (606 > 512). Running this sequence through the model will result in indexing errors Genes: 100%|██████████| 2000/2000 [00:30<00:00, 64.84it/s] Save token positions: 100%|██████████| 1926/1926 [00:01<00:00, 1243.50it/s]
In [14]:
Copied!
print("# Generate NER dataset...")
dataset = 'rice_gene_ner.pkl'
ner_info = tokens_to_nerdata(tokens_bed, annotation_bed, dataset, named_entities, tags_id)
print("# Generate NER dataset...")
dataset = 'rice_gene_ner.pkl'
ner_info = tokens_to_nerdata(tokens_bed, annotation_bed, dataset, named_entities, tags_id)
# Generate NER dataset... 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 .
In [15]:
Copied!
from dnallm import DNADataset, DNATrainer
from dnallm import DNADataset, DNATrainer
In [16]:
Copied!
# Load the datasets
datasets = DNADataset.load_local_data("./rice_gene_ner.pkl", seq_col="sequence", label_col="labels", tokenizer=tokenizer, max_length=1024)
# Encode the sequences with given task's data collator
datasets.encode_sequences(task=configs['task'].task_type, remove_unused_columns=True)
# Split the dataset into train, test, and validation sets
datasets.split_data()
# Load the datasets
datasets = DNADataset.load_local_data("./rice_gene_ner.pkl", seq_col="sequence", label_col="labels", tokenizer=tokenizer, max_length=1024)
# Encode the sequences with given task's data collator
datasets.encode_sequences(task=configs['task'].task_type, remove_unused_columns=True)
# Split the dataset into train, test, and validation sets
datasets.split_data()
Format labels: 0%| | 0/1926 [00:00<?, ? examples/s]
Encoding inputs: 0%| | 0/1926 [00:00<?, ? examples/s]
In [17]:
Copied!
# check the dataset
if hasattr(datasets.dataset, 'keys'):
for split_name in datasets.dataset.keys():
print(f"{split_name}: {len(datasets.dataset[split_name])} samples")
# check the dataset
if hasattr(datasets.dataset, 'keys'):
for split_name in datasets.dataset.keys():
print(f"{split_name}: {len(datasets.dataset[split_name])} samples")
train: 1348 samples test: 385 samples val: 193 samples
In [18]:
Copied!
# Initialize the trainer
trainer = DNATrainer(
model=model,
config=configs,
datasets=datasets
)
# Initialize the trainer
trainer = DNATrainer(
model=model,
config=configs,
datasets=datasets
)
In [19]:
Copied!
# Start training
metrics = trainer.train()
print(metrics)
# Start training
metrics = trainer.train()
print(metrics)
[1011/1011 47:50, Epoch 3/3]
| Step | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|
| 500 | 0.494600 | 0.288843 | 0.905063 | 0.585748 | 0.515507 | 0.548387 |
| 1000 | 0.180700 | 0.193303 | 0.931075 | 0.598708 | 0.574776 | 0.586498 |
{'train_runtime': 2882.0198, 'train_samples_per_second': 1.403, 'train_steps_per_second': 0.351, 'total_flos': 2113463702642688.0, 'train_loss': 0.33529989231234136, 'epoch': 3.0}
In [20]:
Copied!
# Do prediction on the test set
predictions = trainer.infer()
print(predictions.metrics)
# Do prediction on the test set
predictions = trainer.infer()
print(predictions.metrics)
{'test_loss': 0.21177801489830017, 'test_accuracy': 0.9327967958079203, 'test_precision': 0.6445709638064254, 'test_recall': 0.6024325351577346, 'test_f1': 0.6227897838899803, 'test_runtime': 75.823, 'test_samples_per_second': 5.078, 'test_steps_per_second': 0.33}
In [ ]:
Copied!
In [ ]:
Copied!