NER Data Generation

In [1]:

Copied!





# First install two dependencies for generating NER datasets, and make sure you have installed bedtools in your system
# please make sure you have installed bedtools in your system
# !conda install -c bioconda bedtools
# or in macos, you can use brew install bedtools
!brew install bedtools

# Omit it if you have already installed these two packages
!uv pip install pyfastx pybedtools
# First install two dependencies for generating NER datasets, and make sure you have installed bedtools in your system
# please make sure you have installed bedtools in your system
# !conda install -c bioconda bedtools
# or in macos, you can use brew install bedtools
!brew install bedtools

# Omit it if you have already installed these two packages
!uv pip install pyfastx pybedtools

==> Auto-updating Homebrew...
Adjust how often this is run with `$HOMEBREW_AUTO_UPDATE_SECS` or disable with
`$HOMEBREW_NO_AUTO_UPDATE=1`. Hide these hints with `$HOMEBREW_NO_ENV_HINTS=1` (see `man brew`).
==> Downloading https://ghcr.io/v2/homebrew/core/portable-ruby/blobs/sha256:1c98fa49eacc935640a6f8e10a2bf33f14cfc276804b71ddb658ea45ba99d167
######################################################################### 100.0%41.5%            50.3%                        57.0%              58.8%                           59.0%   59.8%          62.0%              63.1%             65.3%         66.3%  67.8%8.5%             71.2%1.9%              72.6%######                      72.8%.0% 75.1%           75.6%   76.3%                   76.6%##                   76.9%###################################                   77.1%               78.2%             78.4%###########                  78.9%%80.1% 80.1% 80.8%#######              84.5% 85.3%            85.6%###             85.7%##            86.6%   87.3%#####           89.0%       91.9% 93.0%     93.1%###############       94.4%###############    97.5%#####################    98.0%   98.7%
==> Pouring portable-ruby-3.4.8.arm64_big_sur.bottle.tar.gz
==> Auto-updated Homebrew!
==> Updated Homebrew from 5.0.3 (c3790fc6e4) to 5.0.7 (d61f229fd2).
Updated 2 taps (homebrew/core and homebrew/cask).
==> New Formulae
astra: Command-Line Interface for DataStax Astra
bookokrat: Terminal EPUB Book Reader
calm-cli: CLI allows you to interact with the Common Architecture Language Model (CALM)
carl: Calendar for the command-line
cinecli: Browse, inspect, and launch movie torrents directly from your terminal
ctre: Compile-time PCRE-compatible regular expression matcher for C++
depot: Build your Docker images in the cloud
dnspyre: CLI tool for a high QPS DNS benchmark
docker-language-server: Language server for Dockerfiles, Compose files, and Bake files
flecs: Fast entity component system for C & C++
garage: S3 object store so reliable you can run it outside datacenters
git-get: Better way to clone, organize and manage multiple git repositories
git-pages: Scalable static site server for Git forges
git-pages-cli: Tool for publishing a site to a git-pages server
goat: General purpose AT Protocol CLI in Go
gup: Update binaries installed by go install
hayagriva: Bibliography management tool
jsonfmt: Like gofmt, but for JSON files
khaos: Kafka traffic simulator for observability and chaos engineering
klog: Command-line tool for time tracking in a human-readable, plain-text file format
[email protected]: Kubernetes command-line interface
kyua: Testing framework for infrastructure software
libevdev: Wrapper library for evdev devices
lispkit: Scheme framework for extension and scripting languages on macOS and iOS
macchanger: Change your mac address, for macOS
mapscii: Whole World In Your Console
mcp-scan: Constrain, log and scan your MCP connections for security vulnerabilities
mistral-vibe: Minimal CLI coding agent
mole: Deep clean and optimize your Mac
mq: Jq-like command-line tool for markdown processing
mufetch: Neofetch-style music cli
neo4j-mcp: Neo4j official Model Context Protocol server for AI tools
netshow: Interactive network connection monitor with friendly service names
nkt: TUI for fast and simple interacting with your BibLaTeX database
octodns: Tools for managing DNS across multiple providers
papis: Powerful command-line document and bibliography manager
phantom: CLI tool for seamless parallel development with Git worktrees
pixlet: App runtime and UX toolkit for pixel-based apps
pony-language-server: Language server for Pony
rad: Modern CLI scripts made easy
redu: Ncdu for your restic repository
rmrfrs: Filesystem cleaning tool
rockcraft: Tool to create OCI images using the language from Snapcraft and Charmcraft
[email protected]: Powerful, clean, object-oriented scripting language
slicot: Fortran subroutines library for systems and control
snitch: Prettier way to inspect network connections
superseedr: BitTorrent Client in your Terminal
svu: Semantic version utility
talm: Manage Talos Linux configurations the GitOps way
tfclean: Remove applied moved block, import block, etc
[email protected]: Incremental parsing library
tronbyt-server: Manage your apps on your Tronbyt (flashed Tidbyt) completely locally
ty: Extremely fast Python type checker, written in Rust
vacuum: World's fastest OpenAPI & Swagger linter
wasm-bindgen: Facilitating high-level interactions between Wasm modules and JavaScript
wifitui: Fast featureful friendly wifi terminal UI
witr: Why is this running?
wuchale: Protobuf-like i18n from plain code
==> New Casks
8bitdo-ultimate-software-v2: Control every piece of your controller
airscroll: Smooth mouse scrolling utility
aks-desktop: Azure Kubernetes Service desktop application
alma: AI chat application
appbox: iOS app distribution tool
athas: Lightweight code editor
cardinal-search: Fastest file searching tool
comet: Web browser with integrated AI assistant
conar: AI-powered database and data management tool
copilot-cli: Brings the power of Copilot coding agent directly to your terminal
copilot-cli@prerelease: Brings the power of Copilot coding agent directly to your terminal
datadog-security-cli: Datadog Security Product CLI
digiexam: Academic testing platform with device lockdown
dnclient: Peer-to-peer VPN client for managed nebula networks
elgato-studio: Capture and manage Elgato devices for content creation
font-amarna
font-bbh-bartle
font-bbh-bogle
font-bbh-hegarty
font-cause
font-geom
font-guguru-sans-code
font-guguru-sans-code-nf
font-gveret-levin
font-line-seed-jp
font-sekuya
fontra-pak: Browser-based font editor
glkvm: App for controlling GL.iNet KVM devices
hyperwhisper: AI-powered speech-to-text transcription
locu: Daily planner and focus timer
m32-edit: Remote control for Midas M32 audio consoles
macdown-3000: Markdown editor with live preview and syntax highlighting
mace: Simplify compliance baseline creation, auditing, and management
maestro: AI agent command center
maru-jan: Play japanese mahjong online
monocle-app: Window dimming utility
motionik: Screen recording software
mountmate: Menubar app to easily manage external drives
mozregression-gui: Interactive regression range finder for Firefox and other Mozilla products
mpluginmanager: Installer for MeldaProduction audio plugins
nanoleaf: Control your Nanoleaf lights
nessie-app: Knowledge base from AI chats
opencode-desktop: AI coding agent desktop client
oracle-data-modeler: Graphical tool for data modeling tasks
papercut-mobility-print-client: Client for printing to PaperCut Mobility Print queues
portalbox: Share a region of your screen in video calls
rocketman-choices-packager: Utility for customising installer package choices
smartsheet: Spreadsheet-style project management solution
snapmaker-orca: Slicing software for Snapmaker 3D printers, a fork of OrcaSlicer
sourcegit: Git GUI client
stirling-pdf: PDF utility
stremio@beta: Open-source media center
support: Menu bar app for user and help desk support
swiftdialog: Admin utility that presents custom dialogs or messages from shell scripts
typeless: AI voice dictation that turns speech into polished text
uuremote: NetEase UU remote desktop access and control tool
vcamapp: Face-tracking virtual avatar app
visualdiffer: Visually compare folders and files
vocaster-hub: Interface controller for Focusrite Vocaster One and Two
wailbrew: Manage Homebrew packages with a UI
wireless-workbench: Desktop app for RF coordination and wireless system management
xmlmind-editor: Strictly validating near WYSIWYG XML editor
yingfu-online: Education app for teens
yoink: Drag and drop utility
zo: Friendly personal server

You have 46 outdated formulae installed.


The 5.0.7 changelog can be found at:
  https://github.com/Homebrew/brew/releases/tag/5.0.7
==> Fetching downloads for: bedtools
⠋ Bottle Manifest bedtools (2.31.1)⠋ Bottle Manifest bedtools (2.31.1)⠙ Bottle Manifest bedtools (2.31.1)⠙ Bottle Manifest bedtools (2.31.1)⠚ Bottle Manifest bedtools (2.31.1)⠚ Bottle Manifest bedtools (2.31.1)⠞ Bottle Manifest bedtools (2.31.1)⠞ Bottle Manifest bedtools (2.31.1)⠖ Bottle Manifest bedtools (2.31.1)⠖ Bottle Manifest bedtools (2.31.1)⠦ Bottle Manifest bedtools (2.31.1)⠦ Bottle Manifest bedtools (2.31.1)⠴ Bottle Manifest bedtools (2.31.1)⠴ Bottle Manifest bedtools (2.31.1) ---------------- Downloading   4.1KB/-------⠲ Bottle Manifest bedtools (2.31.1) ---------------- Downloading   4.1KB/-------✔︎ Bottle Manifest bedtools (2.31.1)                  Downloaded   13.6KB/ 13.6KB
⠳ Bottle bedtools (2.31.1)⠳ Bottle bedtools (2.31.1)⠓ Bottle bedtools (2.31.1)⠓ Bottle bedtools (2.31.1)⠋ Bottle bedtools (2.31.1)⠋ Bottle bedtools (2.31.1)⠙ Bottle bedtools (2.31.1)⠙ Bottle bedtools (2.31.1)⠚ Bottle bedtools (2.31.1)⠚ Bottle bedtools (2.31.1)⠞ Bottle bedtools (2.31.1)⠞ Bottle bedtools (2.31.1)⠖ Bottle bedtools (2.31.1)⠖ Bottle bedtools (2.31.1)⠦ Bottle bedtools (2.31.1)⠦ Bottle bedtools (2.31.1)⠴ Bottle bedtools (2.31.1)⠴ Bottle bedtools (2.31.1)⠲ Bottle bedtools (2.31.1)⠲ Bottle bedtools (2.31.1)⠳ Bottle bedtools (2.31.1)⠳ Bottle bedtools (2.31.1)⠓ Bottle bedtools (2.31.1)⠓ Bottle bedtools (2.31.1)⠋ Bottle bedtools (2.31.1) ------------------------- Downloading  12.3KB/633.1KB⠋ Bottle bedtools (2.31.1) #------------------------ Downloading  24.6KB/633.1KB⠙ Bottle bedtools (2.31.1) #------------------------ Downloading  24.6KB/633.1KB⠙ Bottle bedtools (2.31.1) #------------------------ Downloading  24.6KB/633.1KB⠚ Bottle bedtools (2.31.1) #------------------------ Downloading  24.6KB/633.1KB⠚ Bottle bedtools (2.31.1) ###---------------------- Downloading  77.8KB/633.1KB⠞ Bottle bedtools (2.31.1) ####--------------------- Downloading  98.3KB/633.1KB⠞ Bottle bedtools (2.31.1) #####-------------------- Downloading 131.1KB/633.1KB⠖ Bottle bedtools (2.31.1) #####-------------------- Downloading 131.1KB/633.1KB⠖ Bottle bedtools (2.31.1) #####-------------------- Downloading 131.1KB/633.1KB⠦ Bottle bedtools (2.31.1) #####-------------------- Downloading 139.3KB/633.1KB⠦ Bottle bedtools (2.31.1) #####-------------------- Downloading 139.3KB/633.1KB⠴ Bottle bedtools (2.31.1) #####-------------------- Downloading 139.3KB/633.1KB⠴ Bottle bedtools (2.31.1) ########----------------- Downloading 196.6KB/633.1KB⠲ Bottle bedtools (2.31.1) ########----------------- Downloading 208.9KB/633.1KB⠲ Bottle bedtools (2.31.1) #########---------------- Downloading 217.1KB/633.1KB⠳ Bottle bedtools (2.31.1) #########---------------- Downloading 217.1KB/633.1KB⠳ Bottle bedtools (2.31.1) #########---------------- Downloading 217.1KB/633.1KB⠓ Bottle bedtools (2.31.1) #########---------------- Downloading 237.6KB/633.1KB⠓ Bottle bedtools (2.31.1) #########---------------- Downloading 237.6KB/633.1KB⠋ Bottle bedtools (2.31.1) ##########--------------- Downloading 258.0KB/633.1KB⠋ Bottle bedtools (2.31.1) ##########--------------- Downloading 258.0KB/633.1KB⠙ Bottle bedtools (2.31.1) ##########--------------- Downloading 262.1KB/633.1KB⠙ Bottle bedtools (2.31.1) ###########-------------- Downloading 270.3KB/633.1KB⠚ Bottle bedtools (2.31.1) ###########-------------- Downloading 278.5KB/633.1KB⠚ Bottle bedtools (2.31.1) ###########-------------- Downloading 278.5KB/633.1KB⠞ Bottle bedtools (2.31.1) ###########-------------- Downloading 278.5KB/633.1KB⠞ Bottle bedtools (2.31.1) ############------------- Downloading 303.1KB/633.1KB⠖ Bottle bedtools (2.31.1) ############------------- Downloading 307.2KB/633.1KB⠖ Bottle bedtools (2.31.1) ############------------- Downloading 307.2KB/633.1KB⠦ Bottle bedtools (2.31.1) #############------------ Downloading 319.5KB/633.1KB⠦ Bottle bedtools (2.31.1) #############------------ Downloading 319.5KB/633.1KB⠴ Bottle bedtools (2.31.1) #############------------ Downloading 331.8KB/633.1KB⠴ Bottle bedtools (2.31.1) #############------------ Downloading 335.9KB/633.1KB⠲ Bottle bedtools (2.31.1) ##############----------- Downloading 344.1KB/633.1KB⠲ Bottle bedtools (2.31.1) ##############----------- Downloading 352.3KB/633.1KB⠳ Bottle bedtools (2.31.1) ##############----------- Downloading 360.4KB/633.1KB⠳ Bottle bedtools (2.31.1) ###############---------- Downloading 368.6KB/633.1KB⠓ Bottle bedtools (2.31.1) ###############---------- Downloading 372.7KB/633.1KB⠓ Bottle bedtools (2.31.1) ###############---------- Downloading 372.7KB/633.1KB⠋ Bottle bedtools (2.31.1) ###############---------- Downloading 385.0KB/633.1KB⠋ Bottle bedtools (2.31.1) ################--------- Downloading 393.2KB/633.1KB⠙ Bottle bedtools (2.31.1) ################--------- Downloading 401.4KB/633.1KB⠙ Bottle bedtools (2.31.1) ################--------- Downloading 405.5KB/633.1KB⠚ Bottle bedtools (2.31.1) ################--------- Downloading 417.8KB/633.1KB⠚ Bottle bedtools (2.31.1) #################-------- Downloading 426.0KB/633.1KB⠞ Bottle bedtools (2.31.1) #################-------- Downloading 434.2KB/633.1KB⠞ Bottle bedtools (2.31.1) ##################------- Downloading 446.5KB/633.1KB⠖ Bottle bedtools (2.31.1) ##################------- Downloading 446.5KB/633.1KB⠖ Bottle bedtools (2.31.1) ##################------- Downloading 462.8KB/633.1KB⠦ Bottle bedtools (2.31.1) ##################------- Downloading 462.8KB/633.1KB⠦ Bottle bedtools (2.31.1) ###################------ Downloading 471.0KB/633.1KB⠴ Bottle bedtools (2.31.1) ###################------ Downloading 483.3KB/633.1KB⠴ Bottle bedtools (2.31.1) ###################------ Downloading 491.5KB/633.1KB⠲ Bottle bedtools (2.31.1) ####################----- Downloading 499.7KB/633.1KB⠲ Bottle bedtools (2.31.1) ####################----- Downloading 503.8KB/633.1KB⠳ Bottle bedtools (2.31.1) ####################----- Downloading 503.8KB/633.1KB⠳ Bottle bedtools (2.31.1) #####################---- Downloading 528.4KB/633.1KB⠓ Bottle bedtools (2.31.1) #####################---- Downloading 528.4KB/633.1KB⠓ Bottle bedtools (2.31.1) #####################---- Downloading 540.7KB/633.1KB⠋ Bottle bedtools (2.31.1) #####################---- Downloading 540.7KB/633.1KB⠋ Bottle bedtools (2.31.1) ######################--- Downloading 548.9KB/633.1KB⠙ Bottle bedtools (2.31.1) ######################--- Downloading 561.2KB/633.1KB⠙ Bottle bedtools (2.31.1) ######################--- Downloading 569.3KB/633.1KB⠚ Bottle bedtools (2.31.1) ######################--- Downloading 569.3KB/633.1KB⠚ Bottle bedtools (2.31.1) ######################--- Downloading 569.3KB/633.1KB⠞ Bottle bedtools (2.31.1) #######################-- Downloading 585.7KB/633.1KB⠞ Bottle bedtools (2.31.1) #######################-- Downloading 585.7KB/633.1KB⠖ Bottle bedtools (2.31.1) ########################- Downloading 602.1KB/633.1KB⠖ Bottle bedtools (2.31.1) ########################- Downloading 602.1KB/633.1KB⠦ Bottle bedtools (2.31.1) ########################- Downloading 602.1KB/633.1KB⠦ Bottle bedtools (2.31.1) ########################- Downloading 602.1KB/633.1KB⠴ Bottle bedtools (2.31.1) ########################- Downloading 610.3KB/633.1KB⠴ Bottle bedtools (2.31.1) ######################### Downloading 622.6KB/633.1KB⠲ Bottle bedtools (2.31.1) ######################### Downloading 622.6KB/633.1KB⠲ Bottle bedtools (2.31.1) ######################### Downloading 626.7KB/633.1KB⠳ Bottle bedtools (2.31.1)                           Extracting  633.1KB/633.1KB✔︎ Bottle bedtools (2.31.1)                           Downloaded  633.1KB/633.1KB
==> Pouring bedtools--2.31.1.arm64_tahoe.bottle.tar.gz
🍺  /opt/homebrew/Cellar/bedtools/2.31.1: 43 files, 1.7MB
==> Running `brew cleanup bedtools`...
Disable this behaviour by setting `HOMEBREW_NO_INSTALL_CLEANUP=1`.
Hide these hints with `HOMEBREW_NO_ENV_HINTS=1` (see `man brew`).
==> `brew cleanup` has not been run in the last 30 days, running now...
Disable this behaviour by setting `HOMEBREW_NO_INSTALL_CLEANUP=1`.
Hide these hints with `HOMEBREW_NO_ENV_HINTS=1` (see `man brew`).
Removing: /Users/forrest/Library/Caches/Homebrew/portable-ruby-3.4.7.arm64_big_sur.bottle.tar.gz... (12.2MB)
Removing: /Users/forrest/Library/Caches/Homebrew/bootsnap/e480d26faefcc0649aee4a1b4bb5df1bf6ea887486b0f1485789bf3d26f5e27a... (650 files, 5.6MB)
Removing: /Users/forrest/Library/Caches/Homebrew/bootsnap/42e939983ed75547f42207cad9f1e0fde134291f63f94bcb8df8abbd25416d42... (635 files, 5.5MB)
Removing: /Users/forrest/Library/Logs/Homebrew/ca-certificates... (64B)
Using Python 3.12.8 environment at: /Users/forrest/GitHub/DNALLM/.venv
Resolved 9 packages in 984ms                                         
Installed 3 packages in 17ms                                
 + pybedtools==0.12.0
 + pyfastx==2.2.0
 + pysam==0.23.3

In [2]:

Copied!

# verify  intersectBed
!which intersectBed
# verify  intersectBed
!which intersectBed

/Users/forrest/miniconda3/bin/intersectBed

In [3]:

Copied!





import gzip
import random
import numpy as np
from pyfastx import Fasta
from pybedtools import BedTool
from tqdm import tqdm
from collections import defaultdict
import pickle

from dnallm import load_config, load_model_and_tokenizer
import gzip
import random
import numpy as np
from pyfastx import Fasta
from pybedtools import BedTool
from tqdm import tqdm
from collections import defaultdict
import pickle

from dnallm import load_config, load_model_and_tokenizer

In [4]:

Copied!

# Set random seed
random.seed(42)
# Set random seed
random.seed(42)

In [5]:

Copied!





# Set minimum and maximum extend length around a gene
min_ext = 50
max_ext = 100
ext_list = [[random.randint(min_ext, max_ext), random.randint(min_ext, max_ext)] for x in range(60000)]
# Set minimum and maximum extend length around a gene
min_ext = 50
max_ext = 100
ext_list = [[random.randint(min_ext, max_ext), random.randint(min_ext, max_ext)] for x in range(60000)]

In [6]:

Copied!





# Define Named Entity Recognition (NER) tags and corresponding id
# NER includes IO, IOB, IOE, IOBES, BI, IE and BIES schemes, here we use IOB scheme
# Example:
# ..........[ exon1 ]-----[ exon2 ]-------[ exon3 ]........
# 000000000012222222234444122222222344444412222222200000000
named_entities = {
    'intergenic': 'O',
    'exon0': 'B-EXON',
    'exon1': 'I-EXON',
    'intron0': 'B-INTRON',
    'intron1': 'I-INTRON',
}
tags_id = {
    'O': 0,
    'B-EXON': 1,
    'I-EXON': 2,
    'B-INTRON': 3,
    'I-INTRON': 4,
}
# Define Named Entity Recognition (NER) tags and corresponding id
# NER includes IO, IOB, IOE, IOBES, BI, IE and BIES schemes, here we use IOB scheme
# Example:
# ..........[ exon1 ]-----[ exon2 ]-------[ exon3 ]........
# 000000000012222222234444122222222344444412222222200000000
named_entities = {
    'intergenic': 'O',
    'exon0': 'B-EXON',
    'exon1': 'I-EXON',
    'intron0': 'B-INTRON',
    'intron1': 'I-INTRON',
}
tags_id = {
    'O': 0,
    'B-EXON': 1,
    'I-EXON': 2,
    'B-INTRON': 3,
    'I-INTRON': 4,
}

In [7]:

Copied!





def get_gene_annotation(gene_anno):
    cnt = 0
    gene_info = {}
    for gene in gene_anno:
        gene_info[gene] = []
        chrom = gene_anno[gene]["chrom"]
        start = gene_anno[gene]["start"]
        end = gene_anno[gene]["end"]
        strand = gene_anno[gene]["strand"]
        isoforms = gene_anno[gene]["isoform"]
        # Get representative isoform（longest）
        if not isoforms:
            continue
        lso_lens = [(iso, sum([(x[2]-x[1]) for x in isoforms[iso]])) for iso in isoforms]
        representative = sorted(lso_lens, key=lambda x:x[1])[-1][0]
        isoform_info = isoforms[representative]
        iso_start = min([x[1] for x in isoform_info])
        iso_end = max([x[2] for x in isoform_info])

        if iso_start == start and iso_end == end:
            is_reverse = False if strand == "+" else True
            # Get intron annotation
            last = 0
            for region in sorted(isoform_info, key=lambda x:x[1], reverse=is_reverse):
                if strand == "+":
                    if last:
                        intron = [chrom, last, region[1], "intron", strand]
                        if intron[1] < intron[2]:
                            gene_info[gene].append(intron)
                    last = region[2]
                else:
                    if last:
                        intron = [chrom, region[2], last, "intron", strand]
                        if intron[1] < intron[2]:
                            gene_info[gene].append(intron)
                    last = region[1]
                gene_info[gene].append([chrom, region[1], region[2], region[0], strand])
        cnt += 1

    return gene_info
def get_gene_annotation(gene_anno):
    cnt = 0
    gene_info = {}
    for gene in gene_anno:
        gene_info[gene] = []
        chrom = gene_anno[gene]["chrom"]
        start = gene_anno[gene]["start"]
        end = gene_anno[gene]["end"]
        strand = gene_anno[gene]["strand"]
        isoforms = gene_anno[gene]["isoform"]
        # Get representative isoform（longest）
        if not isoforms:
            continue
        lso_lens = [(iso, sum([(x[2]-x[1]) for x in isoforms[iso]])) for iso in isoforms]
        representative = sorted(lso_lens, key=lambda x:x[1])[-1][0]
        isoform_info = isoforms[representative]
        iso_start = min([x[1] for x in isoform_info])
        iso_end = max([x[2] for x in isoform_info])

        if iso_start == start and iso_end == end:
            is_reverse = False if strand == "+" else True
            # Get intron annotation
            last = 0
            for region in sorted(isoform_info, key=lambda x:x[1], reverse=is_reverse):
                if strand == "+":
                    if last:
                        intron = [chrom, last, region[1], "intron", strand]
                        if intron[1] < intron[2]:
                            gene_info[gene].append(intron)
                    last = region[2]
                else:
                    if last:
                        intron = [chrom, region[2], last, "intron", strand]
                        if intron[1] < intron[2]:
                            gene_info[gene].append(intron)
                    last = region[1]
                gene_info[gene].append([chrom, region[1], region[2], region[0], strand])
        cnt += 1

    return gene_info

In [8]:

Copied!





def tokenization(genome, gene_anno, gene_info, tokenizer, outfile, ext_list, sampling=1e7):
    """
    For each gene in `gene_anno`, extract the annotated exonic (and flanking) DNA subsequences,
    run the tokenizer once per subsequence with offset_mapping enabled, 
    and record the (genomic_start, genomic_end, token) tuples for all non-special tokens.

    - genome: dict mapping chromosome → SeqRecord (so that genome[chrom][start:end].seq gives a Seq)
    - gene_anno: dict mapping gene_name → { "chrom": str, "strand": "+" or "-", ... }
    - gene_info: dict mapping gene_name → list of (feature_id, exon_start, exon_end) or similar
    - tokenizer: a HuggingFace‐style tokenizer that supports return_offsets_mapping
    - outfile: (unused here, but you can write token_pos to it later)
    - ext_list: list of (left_extension, right_extension) tuples parallel to gene_anno order
    - sampling: random sampling the given number of genes for tokenization
    """
    # 1) Precompute special-tokens set for O(1) membership checks
    sp_tokens = set(tokenizer.special_tokens_map.values())

    token_pos = {}
    # Since gene_anno is likely a dict, we need a stable way of iterating + indexing ext_list.
    # We'll assume ext_list[i] corresponds to the i-th gene in `list(gene_anno.keys())`.
    gene_list = list(gene_anno.keys())
    if len(gene_list) > sampling:
        gene_list = random.sample(gene_list, int(sampling))

    for num, gene in enumerate(tqdm(gene_list, desc="Genes")):
        chrom = gene_anno[gene]["chrom"]
        strand = gene_anno[gene]["strand"]

        # Skip genes not in gene_info or with empty annotation
        if gene not in gene_info or not gene_info[gene]:
            continue

        # Determine exon‐range and extended boundaries
        exon_coords = gene_info[gene]
        # start = minimum exon_start; end = maximum exon_end
        start = min(exon[1] for exon in exon_coords)
        end   = max(exon[2] for exon in exon_coords)

        left_ext, right_ext = ext_list[num]
        ext_start = max(0, start - left_ext)
        ext_end   = end + right_ext

        # Shortcut: grab the full chromosome record once
        chrom_record = genome[chrom]

        # Build a list of (genomic_anchor, seq_string) for "+" or "-" strand
        seqinfo = []
        if strand == "+":
            #  1) upstream flank
            try:
                upstream_seq = chrom_record[ext_start:start].seq
            except Exception:
                # If slicing fails, log and skip
                print(f"ERROR: {chrom}\t{ext_start}\t{start}")
                upstream_seq = ""
            seqinfo.append((ext_start, str(upstream_seq)))

            #  2) each exon
            for feature in exon_coords:
                exon_start = feature[1]
                exon_end   = feature[2]
                if exon_start >= exon_end:
                    continue
                seq = chrom_record[exon_start:exon_end].seq
                seqinfo.append((exon_start, str(seq)))

            #  3) downstream flank
            downstream_seq = chrom_record[end:ext_end].seq
            seqinfo.append((end, str(downstream_seq)))

        else:  # strand == "-"
            # On the reverse‐strand, we want the reverse complement ("antisense").
            # Note: .antisense == .reverse_complement() for most SeqRecord slicing.
            # We still record the genomic anchor as if it were the left index on the + strand.
            # But because the sequence is reversed, offset_mapping will need to be mapped differently.

            #  1) “upstream” on reverse strand = (end → ext_end) in forward coords, but take antisense
            try:
                flank_seq = chrom_record[end:ext_end].antisense
            except Exception:
                print(f"ERROR (rev): {chrom}\t{end}\t{ext_end}")
                flank_seq = ""
            seqinfo.append((ext_end, str(flank_seq)))

            #  2) each exon (reverse‐complement)
            for feature in exon_coords:
                exon_start = feature[1]
                exon_end   = feature[2]
                if exon_start >= exon_end:
                    continue
                seq = chrom_record[exon_start:exon_end].antisense
                # For mapping, we’ll anchor each token by the 5′-most position on the minus strand,
                # but because the sequence is reversed, the “first character” of seq actually corresponds
                # to genomic position = exon_end - 1 in forward coordinates, and the “last character” ↦ exon_start.
                seqinfo.append((exon_end, str(seq)))

            #  3) downstream on reverse strand = (ext_start → start) in forward coords, but antisense
            flank_seq = chrom_record[ext_start:start].antisense
            seqinfo.append((start, str(flank_seq)))

        # Initialize the list for this gene
        token_pos[gene] = []

        # For each (anchor, raw_seq), run a single tokenizer(...) call
        for anchor, raw_seq in seqinfo:
            if not raw_seq:
                continue

            # 1) Tokenize with offsets (add_special_tokens=False so we skip [CLS], [SEP], etc.)
            #    “offset_mapping” is a list of (char_start, char_end) for each token in raw_seq.
            # encoding = tokenizer(
            #     raw_seq,
            #     return_offsets_mapping=True,
            #     add_special_tokens=False
            # )
            # offsets = encoding["offset_mapping"]
            # token_ids = encoding["input_ids"]
            token_ids = tokenizer.encode(raw_seq, add_special_tokens=False)
            tok_strs = tokenizer.convert_ids_to_tokens(token_ids)
            offsets = []
            cursor  = 0
            for tok in tok_strs:
                char_start = cursor
                char_end   = cursor + len(tok)
                offsets.append((char_start, char_end))
                cursor = char_end
            if len(offsets) != len(token_ids):
                # This should never happen in a well‐formed tokenizer, but just in case:
                raise RuntimeError("Offset mapping length ≠ token_ids length")

            # 2) Iterate through each token + offset, skip special tokens, then map back to genome coords
            for idx, (token_id, (char_start, char_end)) in enumerate(zip(token_ids, offsets)):
                token_str = tokenizer.convert_ids_to_tokens(token_id)

                # Skip if it’s one of the special tokens (“[PAD]”, “[CLS]”, etc.)
                if token_str in sp_tokens:
                    continue

                if strand == "+":
                    # On the forward strand, raw_seq[0] ↦ genomic position “anchor”.
                    # So any token covering raw_seq[char_start:char_end] ↦ genome positions [anchor+char_start : anchor+char_end]
                    g_start = anchor + char_start
                    g_end   = anchor + char_end

                else:
                    # On the reverse strand, raw_seq was already antisense (reverse), and “anchor” is the forward‐strand coordinate
                    # of the first character in raw_seq.  That first character of raw_seq is actually genome position (anchor-1),
                    # and the last character of raw_seq is genome position (anchor - len(raw_seq)).
                    # More generally, for raw_seq index i, the corresponding forward‐strand position is:
                    #     g_pos = anchor - 1 - i
                    #
                    # Thus, if the token covers raw_seq[char_start:char_end] (i.e. from i = char_start to i = char_end-1),
                    # its genomic coordinates (inclusive‐exclusive) on the forward strand are:
                    #   g_end = (anchor - 1 - char_start) + 1  = anchor - char_start
                    #   g_start = (anchor - 1 - (char_end - 1))  = anchor - char_end
                    #
                    # We want to store them as [g_start, g_end] with g_start < g_end.  So:
                    g_start = anchor - char_end
                    g_end   = anchor - char_start

                token_pos[gene].append([g_start, g_end, token_str])

    # save sequences and tokens
    with open(outfile, "w") as outf:
        for gene in tqdm(token_pos, desc="Save token positions"):
            chrom = gene_anno[gene]["chrom"]
            strand = gene_anno[gene]["strand"]
            for token in token_pos[gene]:
                print(chrom, token[0], token[1], token[2], gene, strand, sep="\t", file=outf)

    return token_pos
def tokenization(genome, gene_anno, gene_info, tokenizer, outfile, ext_list, sampling=1e7):
    """
    For each gene in `gene_anno`, extract the annotated exonic (and flanking) DNA subsequences,
    run the tokenizer once per subsequence with offset_mapping enabled, 
    and record the (genomic_start, genomic_end, token) tuples for all non-special tokens.

    - genome: dict mapping chromosome → SeqRecord (so that genome[chrom][start:end].seq gives a Seq)
    - gene_anno: dict mapping gene_name → { "chrom": str, "strand": "+" or "-", ... }
    - gene_info: dict mapping gene_name → list of (feature_id, exon_start, exon_end) or similar
    - tokenizer: a HuggingFace‐style tokenizer that supports return_offsets_mapping
    - outfile: (unused here, but you can write token_pos to it later)
    - ext_list: list of (left_extension, right_extension) tuples parallel to gene_anno order
    - sampling: random sampling the given number of genes for tokenization
    """
    # 1) Precompute special-tokens set for O(1) membership checks
    sp_tokens = set(tokenizer.special_tokens_map.values())

    token_pos = {}
    # Since gene_anno is likely a dict, we need a stable way of iterating + indexing ext_list.
    # We'll assume ext_list[i] corresponds to the i-th gene in `list(gene_anno.keys())`.
    gene_list = list(gene_anno.keys())
    if len(gene_list) > sampling:
        gene_list = random.sample(gene_list, int(sampling))

    for num, gene in enumerate(tqdm(gene_list, desc="Genes")):
        chrom = gene_anno[gene]["chrom"]
        strand = gene_anno[gene]["strand"]

        # Skip genes not in gene_info or with empty annotation
        if gene not in gene_info or not gene_info[gene]:
            continue

        # Determine exon‐range and extended boundaries
        exon_coords = gene_info[gene]
        # start = minimum exon_start; end = maximum exon_end
        start = min(exon[1] for exon in exon_coords)
        end   = max(exon[2] for exon in exon_coords)

        left_ext, right_ext = ext_list[num]
        ext_start = max(0, start - left_ext)
        ext_end   = end + right_ext

        # Shortcut: grab the full chromosome record once
        chrom_record = genome[chrom]

        # Build a list of (genomic_anchor, seq_string) for "+" or "-" strand
        seqinfo = []
        if strand == "+":
            #  1) upstream flank
            try:
                upstream_seq = chrom_record[ext_start:start].seq
            except Exception:
                # If slicing fails, log and skip
                print(f"ERROR: {chrom}\t{ext_start}\t{start}")
                upstream_seq = ""
            seqinfo.append((ext_start, str(upstream_seq)))

            #  2) each exon
            for feature in exon_coords:
                exon_start = feature[1]
                exon_end   = feature[2]
                if exon_start >= exon_end:
                    continue
                seq = chrom_record[exon_start:exon_end].seq
                seqinfo.append((exon_start, str(seq)))

            #  3) downstream flank
            downstream_seq = chrom_record[end:ext_end].seq
            seqinfo.append((end, str(downstream_seq)))

        else:  # strand == "-"
            # On the reverse‐strand, we want the reverse complement ("antisense").
            # Note: .antisense == .reverse_complement() for most SeqRecord slicing.
            # We still record the genomic anchor as if it were the left index on the + strand.
            # But because the sequence is reversed, offset_mapping will need to be mapped differently.

            #  1) “upstream” on reverse strand = (end → ext_end) in forward coords, but take antisense
            try:
                flank_seq = chrom_record[end:ext_end].antisense
            except Exception:
                print(f"ERROR (rev): {chrom}\t{end}\t{ext_end}")
                flank_seq = ""
            seqinfo.append((ext_end, str(flank_seq)))

            #  2) each exon (reverse‐complement)
            for feature in exon_coords:
                exon_start = feature[1]
                exon_end   = feature[2]
                if exon_start >= exon_end:
                    continue
                seq = chrom_record[exon_start:exon_end].antisense
                # For mapping, we’ll anchor each token by the 5′-most position on the minus strand,
                # but because the sequence is reversed, the “first character” of seq actually corresponds
                # to genomic position = exon_end - 1 in forward coordinates, and the “last character” ↦ exon_start.
                seqinfo.append((exon_end, str(seq)))

            #  3) downstream on reverse strand = (ext_start → start) in forward coords, but antisense
            flank_seq = chrom_record[ext_start:start].antisense
            seqinfo.append((start, str(flank_seq)))

        # Initialize the list for this gene
        token_pos[gene] = []

        # For each (anchor, raw_seq), run a single tokenizer(...) call
        for anchor, raw_seq in seqinfo:
            if not raw_seq:
                continue

            # 1) Tokenize with offsets (add_special_tokens=False so we skip [CLS], [SEP], etc.)
            #    “offset_mapping” is a list of (char_start, char_end) for each token in raw_seq.
            # encoding = tokenizer(
            #     raw_seq,
            #     return_offsets_mapping=True,
            #     add_special_tokens=False
            # )
            # offsets = encoding["offset_mapping"]
            # token_ids = encoding["input_ids"]
            token_ids = tokenizer.encode(raw_seq, add_special_tokens=False)
            tok_strs = tokenizer.convert_ids_to_tokens(token_ids)
            offsets = []
            cursor  = 0
            for tok in tok_strs:
                char_start = cursor
                char_end   = cursor + len(tok)
                offsets.append((char_start, char_end))
                cursor = char_end
            if len(offsets) != len(token_ids):
                # This should never happen in a well‐formed tokenizer, but just in case:
                raise RuntimeError("Offset mapping length ≠ token_ids length")

            # 2) Iterate through each token + offset, skip special tokens, then map back to genome coords
            for idx, (token_id, (char_start, char_end)) in enumerate(zip(token_ids, offsets)):
                token_str = tokenizer.convert_ids_to_tokens(token_id)

                # Skip if it’s one of the special tokens (“[PAD]”, “[CLS]”, etc.)
                if token_str in sp_tokens:
                    continue

                if strand == "+":
                    # On the forward strand, raw_seq[0] ↦ genomic position “anchor”.
                    # So any token covering raw_seq[char_start:char_end] ↦ genome positions [anchor+char_start : anchor+char_end]
                    g_start = anchor + char_start
                    g_end   = anchor + char_end

                else:
                    # On the reverse strand, raw_seq was already antisense (reverse), and “anchor” is the forward‐strand coordinate
                    # of the first character in raw_seq.  That first character of raw_seq is actually genome position (anchor-1),
                    # and the last character of raw_seq is genome position (anchor - len(raw_seq)).
                    # More generally, for raw_seq index i, the corresponding forward‐strand position is:
                    #     g_pos = anchor - 1 - i
                    #
                    # Thus, if the token covers raw_seq[char_start:char_end] (i.e. from i = char_start to i = char_end-1),
                    # its genomic coordinates (inclusive‐exclusive) on the forward strand are:
                    #   g_end = (anchor - 1 - char_start) + 1  = anchor - char_start
                    #   g_start = (anchor - 1 - (char_end - 1))  = anchor - char_end
                    #
                    # We want to store them as [g_start, g_end] with g_start < g_end.  So:
                    g_start = anchor - char_end
                    g_end   = anchor - char_start

                token_pos[gene].append([g_start, g_end, token_str])

    # save sequences and tokens
    with open(outfile, "w") as outf:
        for gene in tqdm(token_pos, desc="Save token positions"):
            chrom = gene_anno[gene]["chrom"]
            strand = gene_anno[gene]["strand"]
            for token in token_pos[gene]:
                print(chrom, token[0], token[1], token[2], gene, strand, sep="\t", file=outf)

    return token_pos

In [9]:

Copied!





def tokens_to_nerdata(tokens_bed, annotation_bed, outfile, named_entities, tags_id):
    """
    Build a token‐level NER dataset by intersecting `tokens_bed` with `annotation_bed`.
    Returns a dict: { 'id': [...geneIDs...], 'sequence': [[token1, token2, …], …],
                     'labels': [[label1, label2, …], …] } 
    and also writes two files:
      1) “outfile” as a pickle of ner_info,
      2) “<outfile>.token_sizes” containing “gene<TAB>token_count” for each gene.
    """

    ne = named_entities
    # Build a map from “baseName + '0' → named_entities[...] → tags_id[...]”
    zero_map = {}
    one_map  = {}
    for base_name, ner_label in ne.items():
        # “intergenic” maps to 'O' no matter whether we’re at a “start” or “inside” —
        # so we do it for both 'intergenic0' and 'intergenic1'.
        if base_name == "intergenic":
            zero_map["intergenic0"] = ner_label
            one_map["intergenic1"] = ner_label
            continue

        # base_name will be something like “exon0” or “exon1”, “intron0”, “intron1”
        # We want to know, whenever the token’s name is exactly “exon” and we’re at a “start” boundary,
        # pick the B-EXON label.  If the name is “exon” but it matched the previous gene-level “name”,
        # then we call named_entities["exon1"] to get “I-EXON”.
        if base_name.endswith("0"):
            zero_map[base_name] = ner_label
        else:
            one_map[base_name]  = ner_label

    # 2) Perform the intersection once (Loj = “left outer join”) so we keep every token
    intersection = BedTool(tokens_bed).intersect(annotation_bed, loj=True)

    # 3) Prepare our output containers
    ner_info = {
        "id":       [],  # list of gene IDs (in the same order as we append)
        "sequence": [],  # each element is a list-of-strings (tokens)
        "labels":   []   # each element is a list-of-ints (NER tags)
    }

    # We'll accumulate (gene, token_count) pairs in-memory, then write them in bulk
    sizes_buffer = []

    # 4) Use defaultdict(set) to track “which token‐IDs we’ve already seen for each gene”
    token_seen = defaultdict(set)

    current_gene = None
    tokens_list  = []
    labels_list  = []
    last_name    = None  # to know if “name == last_name” (inside vs start)

    # 5) Iterate through every interval from the intersection
    #    We rely on the fact that BedTool.intersect(...) returns results in ascending
    #    genomic order, and within each gene, that will appear “in order of token positions.”
    for iv in intersection:
        # Instead of “str(iv).split('\t')”, do:
        chrom   = iv.chrom
        start   = iv.start   # integer
        end     = iv.end     # integer
        token   = iv.name    # 4th column of tokens_bed
        gene    = iv.fields[4]   # 5th column of tokens_bed (original gene ID)
        gene2   = iv.fields[9]   # 10th field (unused here, but was in your code)
        name    = iv.fields[10]  # 11th field = the annotation “name”
        # Build a unique‐ID for this token instance
        token_id = (start, end)

        # 6) When we see a new gene (i.e. “gene != current_gene”), we flush the previous gene’s data
        if gene != current_gene:
            # flush old gene if it exists
            if current_gene is not None:
                # Only append if we actually collected ≥1 token for current_gene
                if tokens_list:
                    sizes_buffer.append((current_gene, len(tokens_list)))
                    ner_info["id"].append(current_gene)
                    ner_info["sequence"].append(tokens_list)
                    ner_info["labels"].append(labels_list)
                    count = len(ner_info["id"])
                    if count % 100 == 0:
                        print(count)
            # Reset for the new gene
            current_gene = gene
            tokens_list  = []
            labels_list  = []
            last_name    = None

        # 7) If we’ve already seen this exact (start,end) “token_id” for this gene, skip
        if token_id in token_seen[gene]:
            continue
        token_seen[gene].add(token_id)

        # 8) Determine the correct NER‐tag (integer) for this token
        #    - If name == "-1" → treat as “intergenic”
        #    - If name == last_name → we pick “inside” (use one_map[name + "1"])
        #    - else → we pick “start”   (use zero_map[name + "0"])
        if name == "-1":
            base_name = "intergenic"
            ner_label = ne[base_name]          # always “O”
        else:
            # If it matched the previous token’s annotation name, choose inside
            if name == last_name:
                lookup_key = name + "1"       # e.g. “exon1” → I-EXON
                ner_label  = one_map.get(lookup_key)
                # If somehow it’s missing, fall back to “start” logic
                if ner_label is None:
                    ner_label = zero_map[name + "0"]
            else:
                # new annotation segment → start
                lookup_key = name + "0"       # e.g. “exon0” → B-EXON
                ner_label  = zero_map.get(lookup_key)
                # If it’s missing, fall back to “intergenic”
                if ner_label is None:
                    ner_label = ne["intergenic"]

        ner_tag = tags_id[ner_label]
        last_name = name

        # 9) Append the token string + numeric label
        tokens_list.append(token)
        labels_list.append(ner_tag)

    # 10) Don’t forget to flush the final gene once the loop ends
    if current_gene is not None and tokens_list:
        sizes_buffer.append((current_gene, len(tokens_list)))
        ner_info["id"].append(current_gene)
        ner_info["sequence"].append(tokens_list)
        ner_info["labels"].append(labels_list)
        print(".", end="")

    # 11) Write out the token_sizes file in one go
    sizes_file = outfile.rsplit(".", 1)[0] + ".token_sizes"
    with open(sizes_file, "w") as tsf:
        for gene_name, count in sizes_buffer:
            tsf.write(f"{gene_name}\t{count}\n")

    # 12) Finally, pickle‐dump ner_info
    with open(outfile, "wb") as handle:
        pickle.dump(ner_info, handle, protocol=pickle.HIGHEST_PROTOCOL)

    return ner_info
def tokens_to_nerdata(tokens_bed, annotation_bed, outfile, named_entities, tags_id):
    """
    Build a token‐level NER dataset by intersecting `tokens_bed` with `annotation_bed`.
    Returns a dict: { 'id': [...geneIDs...], 'sequence': [[token1, token2, …], …],
                     'labels': [[label1, label2, …], …] } 
    and also writes two files:
      1) “outfile” as a pickle of ner_info,
      2) “.token_sizes” containing “genetoken_count” for each gene.
    """

    ne = named_entities
    # Build a map from “baseName + '0' → named_entities[...] → tags_id[...]”
    zero_map = {}
    one_map  = {}
    for base_name, ner_label in ne.items():
        # “intergenic” maps to 'O' no matter whether we’re at a “start” or “inside” —
        # so we do it for both 'intergenic0' and 'intergenic1'.
        if base_name == "intergenic":
            zero_map["intergenic0"] = ner_label
            one_map["intergenic1"] = ner_label
            continue

        # base_name will be something like “exon0” or “exon1”, “intron0”, “intron1”
        # We want to know, whenever the token’s name is exactly “exon” and we’re at a “start” boundary,
        # pick the B-EXON label.  If the name is “exon” but it matched the previous gene-level “name”,
        # then we call named_entities["exon1"] to get “I-EXON”.
        if base_name.endswith("0"):
            zero_map[base_name] = ner_label
        else:
            one_map[base_name]  = ner_label

    # 2) Perform the intersection once (Loj = “left outer join”) so we keep every token
    intersection = BedTool(tokens_bed).intersect(annotation_bed, loj=True)

    # 3) Prepare our output containers
    ner_info = {
        "id":       [],  # list of gene IDs (in the same order as we append)
        "sequence": [],  # each element is a list-of-strings (tokens)
        "labels":   []   # each element is a list-of-ints (NER tags)
    }

    # We'll accumulate (gene, token_count) pairs in-memory, then write them in bulk
    sizes_buffer = []

    # 4) Use defaultdict(set) to track “which token‐IDs we’ve already seen for each gene”
    token_seen = defaultdict(set)

    current_gene = None
    tokens_list  = []
    labels_list  = []
    last_name    = None  # to know if “name == last_name” (inside vs start)

    # 5) Iterate through every interval from the intersection
    #    We rely on the fact that BedTool.intersect(...) returns results in ascending
    #    genomic order, and within each gene, that will appear “in order of token positions.”
    for iv in intersection:
        # Instead of “str(iv).split('\t')”, do:
        chrom   = iv.chrom
        start   = iv.start   # integer
        end     = iv.end     # integer
        token   = iv.name    # 4th column of tokens_bed
        gene    = iv.fields[4]   # 5th column of tokens_bed (original gene ID)
        gene2   = iv.fields[9]   # 10th field (unused here, but was in your code)
        name    = iv.fields[10]  # 11th field = the annotation “name”
        # Build a unique‐ID for this token instance
        token_id = (start, end)

        # 6) When we see a new gene (i.e. “gene != current_gene”), we flush the previous gene’s data
        if gene != current_gene:
            # flush old gene if it exists
            if current_gene is not None:
                # Only append if we actually collected ≥1 token for current_gene
                if tokens_list:
                    sizes_buffer.append((current_gene, len(tokens_list)))
                    ner_info["id"].append(current_gene)
                    ner_info["sequence"].append(tokens_list)
                    ner_info["labels"].append(labels_list)
                    count = len(ner_info["id"])
                    if count % 100 == 0:
                        print(count)
            # Reset for the new gene
            current_gene = gene
            tokens_list  = []
            labels_list  = []
            last_name    = None

        # 7) If we’ve already seen this exact (start,end) “token_id” for this gene, skip
        if token_id in token_seen[gene]:
            continue
        token_seen[gene].add(token_id)

        # 8) Determine the correct NER‐tag (integer) for this token
        #    - If name == "-1" → treat as “intergenic”
        #    - If name == last_name → we pick “inside” (use one_map[name + "1"])
        #    - else → we pick “start”   (use zero_map[name + "0"])
        if name == "-1":
            base_name = "intergenic"
            ner_label = ne[base_name]          # always “O”
        else:
            # If it matched the previous token’s annotation name, choose inside
            if name == last_name:
                lookup_key = name + "1"       # e.g. “exon1” → I-EXON
                ner_label  = one_map.get(lookup_key)
                # If somehow it’s missing, fall back to “start” logic
                if ner_label is None:
                    ner_label = zero_map[name + "0"]
            else:
                # new annotation segment → start
                lookup_key = name + "0"       # e.g. “exon0” → B-EXON
                ner_label  = zero_map.get(lookup_key)
                # If it’s missing, fall back to “intergenic”
                if ner_label is None:
                    ner_label = ne["intergenic"]

        ner_tag = tags_id[ner_label]
        last_name = name

        # 9) Append the token string + numeric label
        tokens_list.append(token)
        labels_list.append(ner_tag)

    # 10) Don’t forget to flush the final gene once the loop ends
    if current_gene is not None and tokens_list:
        sizes_buffer.append((current_gene, len(tokens_list)))
        ner_info["id"].append(current_gene)
        ner_info["sequence"].append(tokens_list)
        ner_info["labels"].append(labels_list)
        print(".", end="")

    # 11) Write out the token_sizes file in one go
    sizes_file = outfile.rsplit(".", 1)[0] + ".token_sizes"
    with open(sizes_file, "w") as tsf:
        for gene_name, count in sizes_buffer:
            tsf.write(f"{gene_name}\t{count}\n")

    # 12) Finally, pickle‐dump ner_info
    with open(outfile, "wb") as handle:
        pickle.dump(ner_info, handle, protocol=pickle.HIGHEST_PROTOCOL)

    return ner_info

In [10]:

Copied!

# Download genome and gene annotation (make sure you have wget command in your path)
!wget -c https://rice.uga.edu/osa1r7_download/osa1_r7.asm.fa.gz
!wget -c https://rice.uga.edu/osa1r7_download/osa1_r7.all_models.gff3.gz
# Download genome and gene annotation (make sure you have wget command in your path)
!wget -c https://rice.uga.edu/osa1r7_download/osa1_r7.asm.fa.gz
!wget -c https://rice.uga.edu/osa1r7_download/osa1_r7.all_models.gff3.gz

--2025-12-28 14:42:11--  https://rice.uga.edu/osa1r7_download/osa1_r7.asm.fa.gz
Resolving rice.uga.edu (rice.uga.edu)... 128.192.162.131
Connecting to rice.uga.edu (rice.uga.edu)|128.192.162.131|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 115858754 (110M) [application/x-gzip]
Saving to: ‘osa1_r7.asm.fa.gz’

osa1_r7.asm.fa.gz   100%[===================>] 110.49M   290KB/s    in 3m 0s   

2025-12-28 14:45:13 (629 KB/s) - ‘osa1_r7.asm.fa.gz’ saved [115858754/115858754]

--2025-12-28 14:45:13--  https://rice.uga.edu/osa1r7_download/osa1_r7.all_models.gff3.gz
Resolving rice.uga.edu (rice.uga.edu)... 128.192.162.131
Connecting to rice.uga.edu (rice.uga.edu)|128.192.162.131|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8901387 (8.5M) [application/x-gzip]
Saving to: ‘osa1_r7.all_models.gff3.gz’

osa1_r7.all_models. 100%[===================>]   8.49M  1.49MB/s    in 10s     

2025-12-28 14:45:25 (850 KB/s) - ‘osa1_r7.all_models.gff3.gz’ saved [8901387/8901387]

In [11]:

Copied!





# Load genome sequence
genome_file = "osa1_r7.asm.fa.gz"
genome = Fasta(genome_file)
# Load annotation
gene_anno = {}
with gzip.open("osa1_r7.all_models.gff3.gz", "rt") as infile:
    for line in tqdm(infile):
        if line.startswith("#") or line.startswith("\n"):
            continue
        info = line.strip().split("\t")
        chrom = info[0]
        datatype = info[2]
        start = int(info[3]) - 1
        end = int(info[4])
        strand = info[6]
        description = info[8].split(";")
        if datatype == "gene":
            for item in description:
                if item.startswith("Name="):
                    gene = item[5:]
            if gene not in gene_anno:
                gene_anno[gene] = {}
                gene_anno[gene]["chrom"] = chrom
                gene_anno[gene]["start"] = start
                gene_anno[gene]["end"] = end
                gene_anno[gene]["strand"] = strand
                gene_anno[gene]["isoform"] = {}
        elif datatype in ["exon"]:
            for item in description:
                if item.startswith("Parent="):
                    isoform = item[7:].split(',')[0]
            if isoform not in gene_anno[gene]["isoform"]:
                gene_anno[gene]["isoform"][isoform] = []
            gene_anno[gene]["isoform"][isoform].append([datatype, start, end])

# Get full gene annotation information and save
gene_info = get_gene_annotation(gene_anno)
annotation_bed = "rice_annotation.bed"
with open(annotation_bed, "w") as outf:
    for gene in sorted(gene_anno, key=lambda x: (gene_anno[x]["chrom"], gene_anno[x]["start"])):
        chrom = gene_anno[gene]["chrom"]
        strand = gene_anno[gene]["strand"]
        if strand == "+":
            for item in gene_info[gene]:
                print(item[0], item[1], item[2], gene, item[3], item[4], sep="\t", file=outf)
        else:
            for item in gene_info[gene][::-1]:
                print(item[0], item[1], item[2], gene, item[3], item[4], sep="\t", file=outf)
# Load genome sequence
genome_file = "osa1_r7.asm.fa.gz"
genome = Fasta(genome_file)
# Load annotation
gene_anno = {}
with gzip.open("osa1_r7.all_models.gff3.gz", "rt") as infile:
    for line in tqdm(infile):
        if line.startswith("#") or line.startswith("\n"):
            continue
        info = line.strip().split("\t")
        chrom = info[0]
        datatype = info[2]
        start = int(info[3]) - 1
        end = int(info[4])
        strand = info[6]
        description = info[8].split(";")
        if datatype == "gene":
            for item in description:
                if item.startswith("Name="):
                    gene = item[5:]
            if gene not in gene_anno:
                gene_anno[gene] = {}
                gene_anno[gene]["chrom"] = chrom
                gene_anno[gene]["start"] = start
                gene_anno[gene]["end"] = end
                gene_anno[gene]["strand"] = strand
                gene_anno[gene]["isoform"] = {}
        elif datatype in ["exon"]:
            for item in description:
                if item.startswith("Parent="):
                    isoform = item[7:].split(',')[0]
            if isoform not in gene_anno[gene]["isoform"]:
                gene_anno[gene]["isoform"][isoform] = []
            gene_anno[gene]["isoform"][isoform].append([datatype, start, end])

# Get full gene annotation information and save
gene_info = get_gene_annotation(gene_anno)
annotation_bed = "rice_annotation.bed"
with open(annotation_bed, "w") as outf:
    for gene in sorted(gene_anno, key=lambda x: (gene_anno[x]["chrom"], gene_anno[x]["start"])):
        chrom = gene_anno[gene]["chrom"]
        strand = gene_anno[gene]["strand"]
        if strand == "+":
            for item in gene_info[gene]:
                print(item[0], item[1], item[2], gene, item[3], item[4], sep="\t", file=outf)
        else:
            for item in gene_info[gene][::-1]:
                print(item[0], item[1], item[2], gene, item[3], item[4], sep="\t", file=outf)

813791it [00:01, 467221.72it/s]

In [12]:

Copied!





# Load configs, model and tokenizer
configs = load_config("./ner_task_config.yaml")
model_name = "zhangtaolab/plant-dnagpt-6mer"
model, tokenizer = load_model_and_tokenizer(model_name, task_config=configs['task'], source="modelscope")
# Load configs, model and tokenizer
configs = load_config("./ner_task_config.yaml")
model_name = "zhangtaolab/plant-dnagpt-6mer"
model, tokenizer = load_model_and_tokenizer(model_name, task_config=configs['task'], source="modelscope")

Downloading Model from https://www.modelscope.cn to directory: /Users/forrest/.cache/modelscope/hub/models/zhangtaolab/plant-dnagpt-6mer

2025-12-28 14:45:31,636 - modelscope - INFO - Got 8 files, start to download ...

Processing 8 items:   0%|          | 0.00/8.00 [00:00<?, ?it/s]

Downloading [configuration.json]:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading [config.json]:   0%|          | 0.00/910 [00:00<?, ?B/s]

Downloading [tokenizer_config.json]:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Downloading [README.md]:   0%|          | 0.00/2.73k [00:00<?, ?B/s]

Downloading [generation_config.json]:   0%|          | 0.00/111 [00:00<?, ?B/s]

Downloading [special_tokens_map.json]:   0%|          | 0.00/581 [00:00<?, ?B/s]

Downloading [vocab.txt]:   0%|          | 0.00/28.0k [00:00<?, ?B/s]

Downloading [model.safetensors]:   0%|          | 0.00/340M [00:00<?, ?B/s]

2025-12-28 14:46:04,983 - modelscope - INFO - Download model 'zhangtaolab/plant-dnagpt-6mer' successfully.

14:46:04 - dnallm.utils.support - INFO - Model files are stored in /Users/forrest/.cache/modelscope/hub/models/zhangtaolab/plant-dnagpt-6mer

Some weights of GPT2ForTokenClassification were not initialized from the model checkpoint at /Users/forrest/.cache/modelscope/hub/models/zhangtaolab/plant-dnagpt-6mer and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

In [13]:

Copied!

# Tokenize gene sequences and generate NER dataset format
print("# Performing sequence tokenization...")
tokens_bed = "rice_genes_tokens.bed"

token_pos = tokenization(genome, gene_anno, gene_info, tokenizer, tokens_bed, ext_list, sampling=2000)
# Tokenize gene sequences and generate NER dataset format
print("# Performing sequence tokenization...")
tokens_bed = "rice_genes_tokens.bed"

token_pos = tokenization(genome, gene_anno, gene_info, tokenizer, tokens_bed, ext_list, sampling=2000)

# Performing sequence tokenization...

Genes:   0%|          | 5/2000 [00:00<00:44, 45.18it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (606 > 512). Running this sequence through the model will result in indexing errors
Genes: 100%|██████████| 2000/2000 [00:30<00:00, 64.84it/s] 
Save token positions: 100%|██████████| 1926/1926 [00:01<00:00, 1243.50it/s]

In [14]:

Copied!

print("# Generate NER dataset...")

dataset = 'rice_gene_ner.pkl'
ner_info = tokens_to_nerdata(tokens_bed, annotation_bed, dataset, named_entities, tags_id)

print("# Generate NER dataset...")

dataset = 'rice_gene_ner.pkl'
ner_info = tokens_to_nerdata(tokens_bed, annotation_bed, dataset, named_entities, tags_id)

# Generate NER dataset...
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
.

In [15]:

Copied!

from dnallm import DNADataset, DNATrainer
from dnallm import DNADataset, DNATrainer

In [16]:

Copied!





# Load the datasets
datasets = DNADataset.load_local_data("./rice_gene_ner.pkl", seq_col="sequence", label_col="labels", tokenizer=tokenizer, max_length=1024)

# Encode the sequences with given task's data collator
datasets.encode_sequences(task=configs['task'].task_type, remove_unused_columns=True)

# Split the dataset into train, test, and validation sets
datasets.split_data()
# Load the datasets
datasets = DNADataset.load_local_data("./rice_gene_ner.pkl", seq_col="sequence", label_col="labels", tokenizer=tokenizer, max_length=1024)

# Encode the sequences with given task's data collator
datasets.encode_sequences(task=configs['task'].task_type, remove_unused_columns=True)

# Split the dataset into train, test, and validation sets
datasets.split_data()

Format labels:   0%|          | 0/1926 [00:00<?, ? examples/s]

Encoding inputs:   0%|          | 0/1926 [00:00<?, ? examples/s]

In [17]:

Copied!





# check the dataset
if hasattr(datasets.dataset, 'keys'):
    for split_name in datasets.dataset.keys():
        print(f"{split_name}: {len(datasets.dataset[split_name])} samples")
# check the dataset
if hasattr(datasets.dataset, 'keys'):
    for split_name in datasets.dataset.keys():
        print(f"{split_name}: {len(datasets.dataset[split_name])} samples")

train: 1348 samples
test: 385 samples
val: 193 samples

In [18]:

Copied!





# Initialize the trainer
trainer = DNATrainer(
    model=model,
    config=configs,
    datasets=datasets
)
# Initialize the trainer
trainer = DNATrainer(
    model=model,
    config=configs,
    datasets=datasets
)

In [19]:

Copied!

# Start training
metrics = trainer.train()
print(metrics)
# Start training
metrics = trainer.train()
print(metrics)

[1011/1011 47:50, Epoch 3/3]

Step	Training Loss	Validation Loss	Accuracy	Precision	Recall	F1
500	0.494600	0.288843	0.905063	0.585748	0.515507	0.548387
1000	0.180700	0.193303	0.931075	0.598708	0.574776	0.586498

{'train_runtime': 2882.0198, 'train_samples_per_second': 1.403, 'train_steps_per_second': 0.351, 'total_flos': 2113463702642688.0, 'train_loss': 0.33529989231234136, 'epoch': 3.0}

In [20]:

Copied!

# Do prediction on the test set
predictions = trainer.infer()
print(predictions.metrics)
# Do prediction on the test set
predictions = trainer.infer()
print(predictions.metrics)

{'test_loss': 0.21177801489830017, 'test_accuracy': 0.9327967958079203, 'test_precision': 0.6445709638064254, 'test_recall': 0.6024325351577346, 'test_f1': 0.6227897838899803, 'test_runtime': 75.823, 'test_samples_per_second': 5.078, 'test_steps_per_second': 0.33}

In [ ]: