Counting seqs in fastas with python

Posted on Jan 27, 2025

I was working with a script in mgnify-pipelines-toolkit, and I bumped into some issues running the test suite locally on my laptop (running macosx).

The problem is caused by how the script was trying to count the number of reads in a fasta and fastq files. It was using the good ‘ol subprocess.Popen way and running zcat, grep and wc -l to get the count. The subprocess calls were expecting the GNU coretuil version and there was a missing flag of sorts in zcat. I decided to look for a fully python implementation of this.

The biopython way

Using biopython it’s fairly straighfoward:

# /// script
# dependencies = [
#   "biopython",
# ]
# ///

from Bio import SeqIO
import gzip

with gzip.open("test.fasta.gz", "rt") as handle:
    record_count = sum(1 for _ in SeqIO.parse(handle, "fasta"))
    print(record_count)

This is fine; you get the number of records. But it’s a bit slow compared to just running grep -c '>'.

$ time uv run test_biopython.py
224191
uv run test_biopython.py  1.28s user 0.04s system 97% cpu 1.356 total

$ time zgrep -c '>' test.fasta.gz
224191
zgrep -c '>' test.fasta.gz  0.90s user 0.02s system 99% cpu 0.918 total

I’m using uv to bootstrap the dependencies (highly recommended). That time I reported is after caching the biopython package, I know this is not perfect.. but it’s close enough to prove my point.

Another thing that annoys me is that I need to give a file handle to SeqIO.parse, and I need to check if the file is compressed or not (to either pass an open() or gzip.open() file handle). I think the ergonomics of this API are not the best.

Pyfastx

Dissatisfied with this I started to look for a better solution, and bumped into Pyfastx.

# /// script
# dependencies = [
#   "pyfastx",
# ]
# ///

import pyfastx

fasta = pyfastx.Fasta("test.fasta.gz", build_index=False)
count = sum(1 for _ in fasta)
print(count)

So much simpler, it automatically handles gzip files and it’s much faster than biopython (pretty comparable to grep)

$ time uv run test_pyfastx.py    
224191
uv run test_pyfastx.py  0.81s user 0.03s system 99% cpu 0.839 total