Wanted to just let you all know that I really appreciated the help you offered when I asked this a few months ago.
Since then, I've jettisoned the idea of using multiple threads and moved to using multiple processes instead. I have built out an API that accepts http requests for queries, chunks them into deterministic jobs (which improves the probability of a cache hit), which are combined with the operation that needs to be run on the job and hashed, sent via 0mq to the many dbnode processes that are running, where the dbnode checks the cache for the hash in question, and if it does not exist, then queries the Sqlite databases for the result. From there the subqueries are sent back to the coordinator in the API node and combined into the result set for the client. It's really fast. It wouldn't be great if I had to write data, but as a data slicer against fixed data, it's pretty slick. Just want to thank those who responded and also those who have created and contributed to SQLite as it's a pretty fantastic way to filter through a few terabytes of information - many billions of rows, in mere milliseconds - even more so when parallelized like this. From: Andrew Brown Sent: Friday, March 3, 2017 12:14 PM To: 'sqlite-users@mailinglists.sqlite.org' <sqlite-users@mailinglists.sqlite.org> Subject: Massively multithreaded SQLite queries Hello, Based on my reading of the documentation it seems like SQLite is a great candidate for cases where you have a large number of threads that only need to read a database simultaneously, in our case for fast data lookup, aggregation, etc. I've been able to generate SQL queries that do this, but once we start running them on a large server with a lot of multithreading going on, I find that we spend a lot of time in __raw_spin_lock - perhaps 75%+ of the time (In one case, 87% of 350 seconds x 72 cores was spent in __raw_spin_lock). This is being run on 64 and 72 core machines, and the more cores I run it on, the slower it ends up going. To give a bit more detail, I'm working with dotnet core, have written a custom sqlite wrapper (since the dotnet core one lacks the ability to set connection flags beyond readonly, and doesn't have sqlite_prepare_v2() implemented), and I'm running on linux against a bunch of SQLite files in the 2gb-400gb size range. Individual queries are wicked fast, but once I start spreading the load over all the cores by running simultaneous queries I lose the performance advantage and it actually becomes significantly slower. Have tried setting SQLITE_OPEN_NOMUTEX, SQLITE_OPEN_FULLMUTEX. Tried shared cache, read uncommitted. Tried without shared cache, read uncommitted. Tried WAL. If I write a less efficient query, I spend less time in __raw_spin_lock, but of course then it takes longer for the queries themselves to return. Any tips to handle massively multithreaded side by side chunked queries on the same database? Thank you in advance, Andrew Brown _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users