Matthew O'Keefe wrote:
> We wanted to post to the mailing list to see if there are any obvious,
> first-order things we can try to improve performance for such a large
> table.

The problem with slow inserts generally speaking lies in the problem of
cache miss.
Imagine that each new insert in index is random. After some time that means
that cache hit ratio becomes close to zero. So that means in case you're on
HDD you have to spend 10ms on each B-Tree page read. Even if you don't have
any flushing at all that can give you ~15 search/insert operations per
second (assuming ~6 levels depth of tree). The more data you have, the
closer you will be to that limit.

There are several inefficiencies in SQLite cache due to simplicity and
having several tables won't help. Having several files will lead to
spreading of cache between connections. Making more efficient cache system
or better defragmentation or another approach to layout data can help but
not radically. That will just move the wall or raise a bottom line from i.e.
15 ops to 50 ops.

So what you should do instead:
1. Make SQLite cache as large as possible.
2. Compact your data as much as possible -  one solution would be to convert
string key to int64 hash value - that will radically compact data.
3. Improve sqlite caching and send a patch to D. Richard Hipp :)

In case you will have cache size larger than DB - mostly every insert will
be very fast.
But with your demand of 1B records - you will have something like 10-20Gb of
sqlite db.
I don't know whether sqlite allows such large buffers defined in 64 bit. I
also don't know do you have that amount of RAM.

SOLUTION PROPOSAL

First proposal is to feed data sorted by key. That would be always fast.

Another approach would be to make delayed inserts. In case you have peaks of
inserts - you can actually create similar empty table and place rows there.
Queries should use UNION and after some time when you'll have i.e. 10000
items in there - you can insert them at once in a sorted by key order. After
insert you should empty your table again. That will be MUCH faster. 


SAMPLE:

PRAGMA cache_size = <Maximum available memory>

CREATE TABLE chunks (nameHash integer primary key, pid integer not null);
CREATE TABLE chunksUpdates (nameHash integer primary key, pid integer not
null);

1. Insert with
INSERT INTO chunksUpdates VALUES(CustomHashFunction("some 88 bytes long key
for whatever reason I need it"), 34234234)

2. Select with
SELECT * FROM chunks WHERE nameHash = CustomHashFunction("some 88 bytes long
key for whatever reason I need it") UNION SELECT * FROM chunksUpdates WHERE
nameHash = CustomHashFunction("some 88 bytes long key for whatever reason I
need it") LIMIT 1

3. From time to time when size of chunksUpdates becomes something like 10000
do following

BEGIN EXCLUSIVE
INSERT INTO chunks SELECT * FROM chunksUpdates ORDER BY nameHash
DELETE FROM chunksUpdates
END

Updates and deletes are different story with the same principles...

Making custom hash function is really important for you and it really should
be 64bit based otherwise you will get a duplicates on collections of 1B
items. 
-- 
View this message in context: 
http://www.nabble.com/very-large-SQLite-tables-tp24201098p24218566.html
Sent from the SQLite mailing list archive at Nabble.com.

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to