Re: [sqlite] Improving Bulk Insert Speed (C/C++)

Kevin Xu Wed, 02 Apr 2014 19:12:07 -0700

Hi Shane,

The SQLite page_size is on its default setting (1024b), while I have not dug 
out the details on my specific SSD (Apple SSD SD512E, with early 2013 retina 
MBPs), a cursory Google search reveals SSDs report a 512b sector size for 
compatibility reasons while internally implementing a different native sector 
size (which is unintuitive since SSDs aren’t even cylindrical, thus no sectors).

I have put up my insertion code here: http://pastebin.com/yjRW2mh3
and a FASTQ sample here: http://pastebin.com/5bRmVjb8

>From the sample and my insertion code, you can see that the majority of the 
>binds are ints, with 3 strings (representing the instrument name (the hardware 
>that sequenced the genome), flow cell name, and index (a 6-10 character ATCG 
>sequence that identifies the original DNA strand…I think). The blob field is 
>~150 bytes long.

I try to reduce the size of the records I am inserting - the blob is already 
compressed (albeit in my naive way). However, since I am reading in the FASTQ 
file sequence-by-sequence, and dumping it as soon as possible into the SQLite 
db, I did not do much normalization (insert triggers were really slow for me, 
keeping track of keys/values myself might entail holding a possibly large 
amount of data in memory - though that is possible) till all the data is 
inserted, then I have another function that does normalization and vacuuming.

>From my insertion code above, I do not think I defined an explicit primary 
>key. Thus, if I use “WITHOUT ROWID”, it wouldn’t work. (I think a rowid 
>performs better than a composite primary key that I will attempt to define, 
>please correct me).

Yes, I have been tasked to start investigating the performance profile of this 
application on the cluster, which is why I have been slightly wary of 
over-optimizing page/cache sizes for my hardware. However, my preliminary 
testing seem to reveal a similar profile (approximately similar times, slightly 
slower since the NAS on the cluster is… bad to say the least). You can read up 
the specifications on the cluster I am going to be running this on: 
http://aciss-computing.uoregon.edu/about-us/hardware/ (I don’t think the NAS is 
performing up to spec - other faculty gave me negative impressions of its 
performance).

Ideally, a bulk load will be done when the scientist gets enough raw FASTQ data 
from the sequencer for his/her specimen. Theoretically, the scientist will only 
need to create this db once for every fragment of DNA for his target specimen, 
and use it as a data source to feed into subsequent tools which will manipulate 
the data. However, the data is often humongous and new data is often collected 
by other researchers in the same lab and has to be merged with existing 
specimens. So yes, the data is loaded once and read many times, though each 
scientist have this habit of loading their own data instead of using someone 
else’s, and the faster he/she can load new data, the better.

Kevin

On Apr 2, 2014, at 9:52 AM, Shane Harrelson <shane.harrel...@gmail.com> wrote:

> What is your page size?   Make sure it matches the sector size of your
> device.   SSDs can be picky about write sizes.  As an experiment, you might
> try using larger page sizes that are multiples of the sector size.
> 
> Try to reduce the size of the records you are writing.  Ie. can you map any
> string data to an integer enumeration?    Perhaps you can "normalize" some
> of the columns to de-duplicate the amount of data stored.   Can you
> compress your blob data?  In general, the less bytes you have to write the
> better.
> 
> If possible, make sure your records fit "nicely" into your page size as
> this should make writes more efficient.  The sqlite-analyzer might give you
> some clues about the "shape" of your database.
> http://www.sqlite.org/download.html
> 
> Do you have a non-integer column defined as a "PRIMARY KEY"?  Perhaps the
> "WITHOUT ROWID" optimization will help.
> http://www.sqlite.org/withoutrowid.html
> 
> You mentioned that this work might get moved to a "computing cluster".  As
> a caution, your tuning for your mac laptop/ssd combination might make
> little difference (or even make it worse) on a machine with different
> cpu/memory/io characteristics.  Ie. you might be severely CPU bound there.
> 
> Are you really going to be doing bulk loads like this all the time?
> Usually data is loaded once, then read many times.  If not, perhaps you
> should be focused on improving your query performance.
> 
> HTH.
> -Shane
> 
> 
> 
> 
> 
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Improving Bulk Insert Speed (C/C++)

Reply via email to