[sqlite] sqlite performance, locking & threading

Emerson Clarke Wed, 27 Dec 2006 20:35:30 -0800

I am developing a multithreaded C++ library which uses sqlite as an embedded
database alongside the mysql client as a simple sql api.  Both databases
share a common interface which supports statements, prepared statements,
recordsets, records, and transactions.


As part of some research and testing i have been playing around with
creating a small filesystem based search engine index.  Each document within
the index is stored as a file in a folder tree structure, but the contents
of the documents are catalogued in two sqlite databases at the root of the
folder tree.  As each document gets added to the index, either or both of
the sqlite databases are updated.

The goal is to be able to index something in the order of 10 to 100
documents a second.  This will allow indexing of around 8 million documents
per day on a single machine.  As such i have been investigating the ways in
which i can get the most performance out of sqlite.

I began by wrapping the entire indexing process in transactions, this
immediately resulted in a huge performance gain but the entire process was
still very synchronous, only indexing one document at a time.  The next
logical step was to add multiple threads to make better use of the operating
system and interleave the various indexing tasks.  At this point i ran into
the limitations which sqlite places on threading, and after several crashes
and error messages concerning locks and routines called out of sequence i
modified the api to ensure that each thread was given its own sqlite3 *
structure.  However in doing this i was penalised by no longer being able to
take advantage of the application wide transaction (or at least that was my
understanding), so i then had to increase the granularity of the
transactions to be within the normal thread synchronisation blocks.  This
decreased the performance to such a degree that it counteracted any gains
which could be made by using multiple threads.

I also attempted to improve performance by using multiple inserts but
discovered that sqlite does not support the same multiple sets of values
that mysql does, "insert into table(a,b) values (1,2),(3,4),(5,6)".  I then
tried to use prepared statements but suffered several scary looking crashes
deep within the sqlite code responsible for closing file handles which i
have yet to debug properly.

I then discovered the pragma directives such as "pragma synchronise=off" and
tried those, but did not get the performance gain that i was looking for
either.

My feelings are that the major gain that i am missing is being able to have
an application wide transaction and multiple threads and also being able to
execute multiple inserts using the values syntax or something similar.
Having read the information on file locking and concurrency (
http://www.sqlite.org/lockingv3.html) and browsed the wiki's and various
other online forums i have some questions regarding the above situation.

The first question is why database locking has been enabled on a per thread
basis instead of per process so that the normal methods of thread
synchronisation (mutexes, ciritcal sections) could not be used for
maintaining consistency within the application.  The programmer can easily
ensure that no two statements are executed at the same time just as they
would ensure that no more than one operation on a linked list was performed
at the same time.  I read somewhere that there is a technical reason for
this behaviour in sqlite, such as the fcntl() function taking out per thread
locks.  But i dont understand why file locking is used at all.  Given that
all platforms, and indeed file systems, have different locking semantics,
would it not be easier to either use a .lock file (in a similar way to the
,journal file) if you really needed to support concurrent access from other
machines. Or just use a much faster IPC mechanism like shared memory and
assume that all access will be from multiple processes on the same machine
(a reasonable assumption for an embedded database).

And the second question is simply how hard is it to support the multiple
insert syntax discussed above, or is it simply a case of there being nothing
to be gained here because the normal insert statement is already fast enough
?  Surely not...

Also are any other optimisation techniques which you can suggest which i
might be missing ?

Apologies for the length, hopefully im not covering well trodden ground here
although i suspect i am.

Emerson

[sqlite] sqlite performance, locking & threading

Reply via email to