Creation time would be a bit faster if you don't create the index until
AFTER importing the data. That way it doesn't have to worry about
keeping things in their proper order on every insert. 

-Andrea

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Brooks, Phil
Sent: Thursday, June 26, 2008 1:58 PM
To: sqlite-users@sqlite.org
Subject: [sqlite] sqlite DB creation performance

Hi,  I am new to the list and to using sqlite - I am looking at 
using sqlite for a 2 way lookup table for my application.   
I have a home grown file based solution, but was hoping to find better
performance and more functionality by putting the lookup tables into
sqlite.
 
I am looking for the fastest way possible to create a large 2 
way lookup dictionary for an application.   While I am 
creating the database, I don't need the database to be in a consistent
or usable state, I just need it to be that way when I get done.  If
something bad happens along the way, that is OK.  In that case, the
database can be recreated from scratch.  I have been playing around with
options (page size, turn off journaling, different indexes) that have
improved the database creation phase quite a bit and am wondering if the
group can provide additional insight:
 
More Detail
-----------
The tables simply provide fast lookup to return a string on the
left-side when presented with its right-side pair and visa versa. Both
strings (left and right) belong to a group of entries I'll call 'group'.
The groups are all independent of one another.  The database contains 2
of these cross reference lookup tables and a bunch of smaller stuff that
helps make sense of things, but doesn't really matter a lot in terms of
size and performance.
 
The two dictionary tables are pretty big -- the testcase I am using has
one table with 43 million entries and another table with 7 million
entries.
 
To figure out the performance of this system under sqlite, I am just
putting the two dictionaries into a sqlite database that looks like
this:
 
First Try
---------
/* Create the tables: */
CREATE TABLE foo(left string, right string, group int, flags int);
CREATE TABLE bar(left string, right string, group int);

/* Create the indices: */
CREATE INDEX left_foo_dict on foo ( left ); CREATE INDEX right_foo_dict
on foo ( right ); CREATE INDEX left_bar_dict on bar( left ); CREATE
INDEX right_bar_dict on bar( right );
 
/* Import the data: */
.separator ,
.import foo.csv foo
.import bar.csv bar
 
The strings in the data range in size from about 2 characters up to
about 60, but they are, theoretically, unlimited in size.

My ascii input csv files are about 5GB for foo.csv and about 500MB for
bar.csv.  My old database format requires about 7GB to store the
database and it takes about 30-40 minutes to create.
 
So I ran the above script:
 
$ time sqlite3 db.sql < script1

real    2h26m47.68s
user    49m59.09s
sys     4m47.75s

sql file size is about 14GB.  OK, that is a starting point 5-6x the
creation time and twice the space of my original format.  On the bright
side, opening the sqlite database is vastly faster than on my old format
(instant vs. ~16min) and lookups are about the same, maybe slightly
faster in some cases.
 
Second Try
I looked at the docs and found a few intersting pragmas
- since the csv data is naturally ordered on the group field, I decided
to see if creating a dual key would help in any way:
 
/* First I tried some pragmas that looked promising */ PRAGMA
journal_mode = OFF; PRAGMA legacy_file_format = OFF; PRAGMA
page_size=32768;
 
/* Create the tables: */
CREATE TABLE foo(left string, right string, group int, flags int);
CREATE TABLE bar( left string, right string, group int );

/* Next, I tried adding an ascending key on group */ CREATE INDEX
left_foo_dict on foo ( group asc, left ); CREATE INDEX right_foo_dict on
foo ( group asc, right ); CREATE INDEX left_bar_dict on bar( group asc,
left ); CREATE INDEX right_bar_dict on bar( group asc, right );
 
/* import the data */
.separator ,
.import foo.csv foo
.import bar.csv bar
 
Slight improvement in size - down to about 13GB now.  Run time improved
a lot - down to about 1 hour, though I don't really know which of the
things I tried was responsible for the improvement. -- I'll try them one
at a time at some point to see which actually helped.
 
Now the Question
Can anyone suggest additional things I might try to: 

    1) reduce db creation time
- or - 
    2) reduce the size of the file?  

I could actually live with a larger than 7GB file size if I could beat
the 30-40 minute runtime of my original solution 
by a resonable margin.   Also, I am assuming that importing 
a .csv file using .import will be similar in speed to creating the
database using the c++ interface using the
sqlite3_prepare/sqlite3_bind/sqlite3_step/sqlite3_reset
interfaces -- is that a valid assumption?
 
Phil
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to