Hi, I am new to the list and to using sqlite - I am looking at
using sqlite for a 2 way lookup table for my application.
I have a home grown file based solution, but was hoping to
find better performance and more functionality by putting
the lookup tables into sqlite.
I am looking for the fastest way possible to create a large 2
way lookup dictionary for an application. While I am
creating the database, I don't need the database to be
in a consistent or usable state, I just need it to be
that way when I get done. If something bad happens
along the way, that is OK. In that case, the database
can be recreated from scratch. I have been playing
around with options (page size, turn off journaling,
different indexes) that have improved the database
creation phase quite a bit and am wondering if the group
can provide additional insight:
More Detail
-----------
The tables simply provide fast lookup to return a string
on the left-side when presented with its right-side pair
and visa versa. Both strings (left and right) belong to
a group of entries I'll call 'group'. The groups are all
independent of one another. The database contains 2 of
these cross reference lookup tables and a bunch of smaller
stuff that helps make sense of things, but doesn't really
matter a lot in terms of size and performance.
The two dictionary tables are pretty big -- the
testcase I am using has one table with 43 million entries
and another table with 7 million entries.
To figure out the performance of this system under sqlite,
I am just putting the two dictionaries into a
sqlite database that looks like this:
First Try
---------
/* Create the tables: */
CREATE TABLE foo(left string, right string, group int, flags int);
CREATE TABLE bar(left string, right string, group int);
/* Create the indices: */
CREATE INDEX left_foo_dict on foo ( left );
CREATE INDEX right_foo_dict on foo ( right );
CREATE INDEX left_bar_dict on bar( left );
CREATE INDEX right_bar_dict on bar( right );
/* Import the data: */
.separator ,
.import foo.csv foo
.import bar.csv bar
The strings in the data range in size from about 2 characters
up to about 60, but they are, theoretically, unlimited in size.
My ascii input csv files are about 5GB for foo.csv and about
500MB for bar.csv. My old database format requires about 7GB
to store the database and it takes about 30-40 minutes to create.
So I ran the above script:
$ time sqlite3 db.sql < script1
real 2h26m47.68s
user 49m59.09s
sys 4m47.75s
sql file size is about 14GB. OK, that is a starting point
5-6x the creation time and twice the space of my original
format. On the bright side, opening the sqlite database
is vastly faster than on my old format (instant vs. ~16min)
and lookups are about the same, maybe slightly faster in
some cases.
Second Try
I looked at the docs and found a few intersting pragmas
- since the csv data is naturally ordered on the group field,
I decided to see if creating a dual key would help in any way:
/* First I tried some pragmas that looked promising */
PRAGMA journal_mode = OFF;
PRAGMA legacy_file_format = OFF;
PRAGMA page_size=32768;
/* Create the tables: */
CREATE TABLE foo(left string, right string, group int, flags int);
CREATE TABLE bar( left string, right string, group int );
/* Next, I tried adding an ascending key on group */
CREATE INDEX left_foo_dict on foo ( group asc, left );
CREATE INDEX right_foo_dict on foo ( group asc, right );
CREATE INDEX left_bar_dict on bar( group asc, left );
CREATE INDEX right_bar_dict on bar( group asc, right );
/* import the data */
.separator ,
.import foo.csv foo
.import bar.csv bar
Slight improvement in size - down to about 13GB now. Run
time improved a lot - down to about 1 hour, though I don't
really know which of the things I tried was responsible for
the improvement. -- I'll try them one at a time at some point
to see which actually helped.
Now the Question
Can anyone suggest additional things I might try to:
1) reduce db creation time
- or -
2) reduce the size of the file?
I could actually live with a larger than 7GB file size if
I could beat the 30-40 minute runtime of my original solution
by a resonable margin. Also, I am assuming that importing
a .csv file using .import will be similar in speed to
creating the database using the c++ interface using the
sqlite3_prepare/sqlite3_bind/sqlite3_step/sqlite3_reset
interfaces -- is that a valid assumption?
Phil
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users