Daniel Meyer wrote:
We are interested in using sqlite as a read only, in memory, parallel access database. We have database files that are on the order of 100GB that we are loading into memory. We have found great performance when reading from a single thread. We need to scale up to have many parallel reader threads. Once the DB is created it never needs to be modified. How can we allow many reader threads on an in memory, write once read many times database and achieve multi-core performance? Is this possible with sqlite?Thanks for all the helpful responses. I have moved forward experimenting with using parallel readers on an in memory sqlite database. I have found that I get true concurrency (multi-core speed up) when I create a new connection to my database file on disk in every thread. I've verified this by running a single long query and then running the same query in several threads and ensuring the net time is the same as the single thread query time. In order to get parallel readers on an in memory database I first loaded the file into memory with: rc = sqlite3_open_v2("file::memory:?cache=shared", &db, SQLITE_OPEN_URI | SQLITE_OPEN_READWRITE, NULL); I hold onto the db reference in the main thread after loading the data and don't close that connection until all the worker threads are done attempting to run my long query. I then spawn N threads, each creating their own connection like so: rc = sqlite3_open_v2("file::memory:?cache=shared", &db, SQLITE_OPEN_URI | SQLITE_OPEN_READONLY | SQLITE_OPEN_NOMUTEX, NULL); I have experimented with the various flags; however, everything I do gives me serial performance. There is some kind of mutex locking the database so that running my query say two times takes twice as long as running it once whereas with the disk based approach using this connection string in each thread: rc = sqlite3_open("data.sl3", &db); makes the total time to run the query twice the same as running it just once. I am hypothesizing that since we are using 'cache=shared' and hence each thread is sharing the same cache each read requires locking the database. What I would like is to get the same kind of behavior as we get with " file::memory:?cache=shared" wherein every time I open a new connection that connection points to the same memory; however, does not actually involve sharing the cache so no global mutex locks the database on every read. I have put my test code in a gist. My C code is here: https://gist.github.com/danielrmeyer/fae54d5993f2800626c616e72782b5eb I generate the 1.5GB test database with this python 3.4 script: https://gist.github.com/danielrmeyer/bfa415256502471d1512f2155e76adc2 I compiled the C code on my system with the following command: gcc -std=gnu99 test.c -o test -lsqlite3 -lpthread (I did download the amalgamation and copied the sqlite.h and sqlite.o files into my cwd after building) I apologize if the C code is not as clean as it could be. I'm primarily a Python programmer but figured i'd be more likely to get help with a C test case so I did my best to hack this together. The Python GIL was confusing the situation in any case. A little background on what I am doing: I have several large datasets that I wish to serve up to customers to generate custom reports based on unique slices of the data. I am using a cluster of machines with .5TB of memory each so loading all the data into memory is reasonable in my case. I've found that against my production work load I get massive speedups in single threaded tests against the in memory database relative to the disk version. In fact I have found that the single threaded sqlite in memory tests are faster than all the other database solutions i've looked at so I am very excited about using sqlite, nevertheless I really need to scale to many cores. Also, my work load is highly random so cache is not much help. I really want the data in memory. Any help is greatly appreciated. I have started experimenting with memory mapped io; however, I have not had much luck so far.
Use LMDB in SQLightning. Since LMDB reads acquire no locks, they scale perfectly linearly across arbitrarily many CPUs. No other DB engine will do what you're looking for. Everything else based on locking will bottleneck as soon as you extend beyond a single CPU socket.
I would not mind creating a fork of sqlite on github and hacking the code if someone could give me pointers on what needs to be modified to get this working. Certainly if there is an extra flag or URI I need to use to get concurrent in memory read access that would be great, but I'm willing to try and modify the source code and sharing with the community if I can figure out how to get this going.
-- -- Howard Chu CTO, Symas Corp. http://www.symas.com Director, Highland Sun http://highlandsun.com/hyc/ Chief Architect, OpenLDAP http://www.openldap.org/project/ _______________________________________________ sqlite-users mailing list [email protected] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

