Hi Shane, The SQLite page_size is on its default setting (1024b), while I have not dug out the details on my specific SSD (Apple SSD SD512E, with early 2013 retina MBPs), a cursory Google search reveals SSDs report a 512b sector size for compatibility reasons while internally implementing a different native sector size (which is unintuitive since SSDs aren’t even cylindrical, thus no sectors).
I have put up my insertion code here: http://pastebin.com/yjRW2mh3 and a FASTQ sample here: http://pastebin.com/5bRmVjb8 >From the sample and my insertion code, you can see that the majority of the >binds are ints, with 3 strings (representing the instrument name (the hardware >that sequenced the genome), flow cell name, and index (a 6-10 character ATCG >sequence that identifies the original DNA strand…I think). The blob field is >~150 bytes long. I try to reduce the size of the records I am inserting - the blob is already compressed (albeit in my naive way). However, since I am reading in the FASTQ file sequence-by-sequence, and dumping it as soon as possible into the SQLite db, I did not do much normalization (insert triggers were really slow for me, keeping track of keys/values myself might entail holding a possibly large amount of data in memory - though that is possible) till all the data is inserted, then I have another function that does normalization and vacuuming. >From my insertion code above, I do not think I defined an explicit primary >key. Thus, if I use “WITHOUT ROWID”, it wouldn’t work. (I think a rowid >performs better than a composite primary key that I will attempt to define, >please correct me). Yes, I have been tasked to start investigating the performance profile of this application on the cluster, which is why I have been slightly wary of over-optimizing page/cache sizes for my hardware. However, my preliminary testing seem to reveal a similar profile (approximately similar times, slightly slower since the NAS on the cluster is… bad to say the least). You can read up the specifications on the cluster I am going to be running this on: http://aciss-computing.uoregon.edu/about-us/hardware/ (I don’t think the NAS is performing up to spec - other faculty gave me negative impressions of its performance). Ideally, a bulk load will be done when the scientist gets enough raw FASTQ data from the sequencer for his/her specimen. Theoretically, the scientist will only need to create this db once for every fragment of DNA for his target specimen, and use it as a data source to feed into subsequent tools which will manipulate the data. However, the data is often humongous and new data is often collected by other researchers in the same lab and has to be merged with existing specimens. So yes, the data is loaded once and read many times, though each scientist have this habit of loading their own data instead of using someone else’s, and the faster he/she can load new data, the better. Kevin On Apr 2, 2014, at 9:52 AM, Shane Harrelson <shane.harrel...@gmail.com> wrote: > What is your page size? Make sure it matches the sector size of your > device. SSDs can be picky about write sizes. As an experiment, you might > try using larger page sizes that are multiples of the sector size. > > Try to reduce the size of the records you are writing. Ie. can you map any > string data to an integer enumeration? Perhaps you can "normalize" some > of the columns to de-duplicate the amount of data stored. Can you > compress your blob data? In general, the less bytes you have to write the > better. > > If possible, make sure your records fit "nicely" into your page size as > this should make writes more efficient. The sqlite-analyzer might give you > some clues about the "shape" of your database. > http://www.sqlite.org/download.html > > Do you have a non-integer column defined as a "PRIMARY KEY"? Perhaps the > "WITHOUT ROWID" optimization will help. > http://www.sqlite.org/withoutrowid.html > > You mentioned that this work might get moved to a "computing cluster". As > a caution, your tuning for your mac laptop/ssd combination might make > little difference (or even make it worse) on a machine with different > cpu/memory/io characteristics. Ie. you might be severely CPU bound there. > > Are you really going to be doing bulk loads like this all the time? > Usually data is loaded once, then read many times. If not, perhaps you > should be focused on improving your query performance. > > HTH. > -Shane > > > > > > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users