Hi everyone I have been working on a small bioinformatics project that
attempts to store FASTQ (http://en.wikipedia.org/wiki/FASTQ_format) data
into a SQLite database.

The app works by reading in the FASTQ file (through Boost::memory_mapped
regions), running the parser through it, binding the variables using
sqlite3_bind_xxx, and inserting them in a prepared statement wrapped within
BEGIN...END statements.

The FASTQ file is parsed using a Boost::Spirit parser I wrote that spits
out structs with 13 members that get bound to the fields of a sqlite3
prepared statement. One of those fields is a BLOB column that contains
about 151 bytes of data per row (If you look at FASTQ format, I basically
combined the quality and sequences together using a simple proprietary
"compression" scheme that achieves 50% compression). There are no indexes
other than the default rowid.

I have several datasets I tested this on, with 100K (each sequence is made
up of 4 lines of ASCII text in the original file), 500K, and 91Mil
sequences (the data is paired end, which means that the dataset comes in
pairs - if 1 fastq file contains 100K sequences, there will be a paired
fastq file containing another 100K (from the "other end" of the dna
fragment). I have currently achieved:

100K rows - 1.11s
500K rows - 5.12s
182M  rows - 1772s

which works to out to approximately 90K-100K inserts per second. The
testing platform is currently on a i7 laptop with an SSD, and it might
transition to a computing cluster in a university environment if it works
out. I am trying to see if it is possible to achieve even higher throughput
- my professor was adamant that he managed to insert 3 million rows in 7
seconds (which works out to over 420K inserts per second) though he could
not find his code that did it or records of his insertions.

When I profiled the application (using Instruments), and after inverting
the call tree, the time spent within the program is broken down as follows:
(-> means called by)

2170ms 15.1% - sqlite3VdbeExec -> sqlite3_step -> insert function
2142ms 14.9% - pwrite -> unixWrite -> pager_write/pager_write_pagelist
1925ms 14.9% - std::string::insert -> boost::spirit
539ms    3.7% - pack (my compression function)
...

The performance profile seem to suggest that I am already spending the
majority of the time actually doing inserts called by sqlite3 - I don't
really see any way to improve on insert throughput any further.

Hope someone more knowledgeable in this area can help me out with this!

Kevin Xu
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to