Re: [sqlite] sqlite performance with sizeable tables

spaminos-sqlite Fri, 10 Mar 2006 18:42:44 -0800

----- Original Message ----
> From: Christian Smith <[EMAIL PROTECTED]>


> But SQLite depends on the OS caching abilities for much of it's
> performance. Removing it is like saying SQLite is rubbish on Intel
> processors after testing on a i486.

yes and no: while it's nice to be able to rely somehow on the OS cache, the OS 
can not guess what the usage pattern of the app is going to be.
Most of the time, the os will guess wrong too: a typical case is random access 
of a file, the os does readahead for example that is not beneficial in that 
case (and actually can slow things down).
I always try to give the best hints possible to the underlying subsystem (like 
never rely 100% on the optimizer to know what to do with a bunch of code, 
because it won't).

When dealing with big databases (that don't typically fit in memory), the OS 
will most likely cache the wrong kind of data, where as the lib could cache 
important information such as some key offsets in the file (the cache would 
contain some elements from the btree so that we can resume directly from there).

> The SQLite cache is very restricted. It only caches data over a single
> transaction, discarding it on COMMIT. This is because another process may
> update the file between transactions.
Oh I didn't know that the db cache was that short lived!?


> The file format has a file generation number, which could be used to delay
> the discarding of the SQLite cache between transactions, but I believe it
> is as yet unimplemented.

Oh I see the cache problem is probably related to multiple process access.
In my case I have a multithreaded process using nice locks to access the db 
(that way I never get busy database too).

> >I was very excited about this result, so I tried async access on my
> >actual app (that holds 8M+ records), but I was disapointed to see that
> >async actually even slow things down?!
> By much? This surprises me. Perhaps there is another bottleneck in the app
> that isn't covered by this test case?

Well, this test focuses on accessing one database, and my app accesses several.
It was not much slower, maybe a 10% penalty, but I was expecting a dramatic 
gain of performance :)

On the other hand, I tried to make better use of the cache: if I run my 1M 
inserts in 10 transactions of 100,000,  things get a bit slower than 100 
transactions of 10,000 inserts.
I tried one transaction of 1,000,000 inserts and the test app hangs at 100% cpu 
for over 30 minutes now, not sure about what is going on here.

> >Oh also, a way to improve performance:
> >reads can be reordered, so if sqlite was doing async reads of several
> >places in the db file at the same time, the OS/drive would reorder things
> >the same way they are reordered when writing for a good boost in
> >performance.

> This would add complication to SQLite. We depend on the OS cache to shield
> us from having to do async IO. It's difficult to share SIGIO, so an app
> that also did async IO would have problems with SQLite. In general, read
> performance is not a problem in SQLite. The biggest bottleneck is
> transaction COMMIT, due to the synchronous nature of the operation.
On unix/linux systems SIGIO is one way of doing async stuff (I never use it 
because it doesn't play well with threads).
Using poll or select will not interfere with other parts of the application.
I do believe that batching requests can dramatically improve performance:
a typical join will read 2 tables on disk, causing a lot of seeks.
Read performance is important when dealing with large files (relative to RAM 
size), I believe that quite a bit of people use sqlite with multi GB files, so 
I don't think this is a stupid thing to look into.

> General observations/questions:
> - What did you expect from SQLite? Can you give indication of performance
>   expectations?
> - What did you use previously? How does SQLite compare on performance?
I was expecting much better performance that what I am getting right now:
before I was using the filesystem as a DB (basically, folders + filenames -> 
data)
Namely reiserfs performs pretty well for doing this kind of thing (millions of 
record).
But that was wasting a lot of disk space and hitting all sorts of limits in the 
various OSes.

Tried BDB a few years ago as a replacement, but "wedged" databases and 
licensing issues kept us away from it.
then I red about sqlite giving performance of the order of 25000 inserts/second.
I thought that I could probably get around 10000 inserts/second on bigger 
databases.

Right now, sqlite shows performance that is on par with a simple filesystem 
structure (except it's much faster to backup because traversing a multimilion 
file structure takes several hours).
I was expecting a few things by moving to sqlite:
* getting a better ramp up (cold cache) performance than a dump filesystem 
structure.
* having a [much] higher throughput (so that I can combine multiple servers 
into one), as the records are small and there is no fragmentation of the 
folders here.
* having consistant latencies (filesystems tend to vary a lot).

> - Have you tested other databases? What sort of performance did you get
  from those?
I am in the process of setting up mysql with our app, I will keep you posted on 
the result.

> - If you're not batching, or can't batch updates, then SQLite may very
  well not be performant enough due to inherent restrictions of it's
  architecture.
You are not the first person to mention batching, what does that mean exactly? 
I have been using inserts in transactions, isn't that it?

> - An idea of what your app does in general would be of assistance in
>   helping you find your bottleneck. A specific test case is good if this
>   is indeed the issue, but it doesn't sound like it is.
I am focusing right now on the time it takes to run a simple import from 
foreign format kind of thing:
One thread only, doing simple selects (retrieve users profiles), and replace 
(to write back the same profiles).
 To accomodate a lot of users, I have one main DB that holds all users -> 
unique ID
 The other dbs are a partitioned db really, so that later threads conflict only 
when accessing users within the same range and the dbs don't get out of 
proportion (because blobs are used I thought a lot of records could become a 
performance bottleneck).
 If I precache the big table name -> uid, the import of 650,000 records takes 
about 2 minutes.
If I don't precache it, it takes about 8 minutes.
I start/stop transactions every X seconds (right now, I have it set at 10 
seconds).

Cheers

Nicolas

Re: [sqlite] sqlite performance with sizeable tables

Reply via email to