Hi.

I have a database with a table having a text column that averages 7.1 
characters per row; there are 11 million rows in this table.  In one 
version of the source file that is used to populate this table, the file 
is laid out such that the column data is populated in sorted order; in 
another version it is not sorted.  In the sorted version, creating the 
index takes about a minute; in the unsorted version it takes over 8 
hours (I killed it before it finished).  I've tried this with a couple 
different versions of SQLite (3.6.1 and 3.6.7) and with various PRAGMA 
options (cache_size up to 1M, synchronous OFF, etc).  I've tried this on 
linux 2.6.24 and Darwin 9.6 (MacOS X 10.5.6), and Windows XP with the 
same result.  Interestingly, on Vista, it's much faster, perhaps 10 
minutes to create the index.  I suspect this is because of it's superior 
disk caching.  By increasing the cache_size to 1M, I can get it to index 
in about 2.5 minutes, but the process consumes about 600 MB of RAM, and 
decreasing the cache_size after doesn't cause sqlite to release all of 
the memory it has used.

This TODO item is of course interesting:

  * Develop a new sort implementation that does much less disk seeking. 
Use to improve indexing performance on large tables.


Is there any thought as to exactly how/when this might happen?

Any other ideas?

Thanks a bunch!

-c



Accessions is the table of interest:

CREATE TABLE AccessionToGoID (accession INT, goid INT);
CREATE TABLE Accessions (id INT IDENTITY, accession VARCHAR(30) NOT 
NULL, taxonomy INT);
CREATE TABLE GoID (id INT IDENTITY, qualifier VARCHAR(20), goid INT, 
evidence TINYINT, aspect CHAR(1), source INT);
CREATE TABLE Location (url TEXT);
CREATE TABLE Sources (id INT, name VARCHAR(20));
CREATE INDEX ACESSIONS_STRING ON ACCESSIONS("accession");
CREATE INDEX ACESSIONS_TO_GOID ON ACCESSIONTOGOID("accession");
CREATE INDEX GOID_ID ON GOID("id");



Page size in bytes.................... 1024
Pages in the whole file (measured).... 2201999
Pages in the whole file (calculated).. 2201998
Pages that store data................. 2201998    100.000%
Pages on the freelist (per header).... 0            0.0%
Pages on the freelist (calculated).... 1            0.0%
Pages of auto-vacuum overhead......... 0            0.0%
Number of tables in the database...... 6
Number of indices..................... 0
Number of named indices............... 0
Automatically generated indices....... 0
Size of the file in bytes............. 2254846976
Bytes of user payload stored.......... 1406889293  62.4%

*** Page counts for all tables with their indices ********************

ACCESSIONTOGOID....................... 1159789     52.7%
GOID.................................. 775451      35.2%
ACCESSIONS............................ 266755      12.1%
LOCATION.............................. 1            0.0%
SOURCES............................... 1            0.0%
SQLITE_MASTER......................... 1            0.0%

*** All tables *******************************************************

Percentage of total database.......... 100.000%
Number of entries..................... 113096707
Bytes of storage consumed............. -2040121344
Bytes of payload...................... 1406889777 -69.0%
Average payload per entry............. 12.44
Average unused bytes per entry........ 0.21
Average fanout........................ 90.00
Fragmentation.........................   7.8%
Maximum payload per entry............. 135
Entries that use overflow............. 0            0.0%
Index pages used...................... 24343
Primary pages used.................... 2177655
Overflow pages used................... 0
Total pages used...................... 2201998
Unused bytes on index pages........... 2982024     12.0%
Unused bytes on primary pages......... 20271723    -1.0%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 23253747    -1.1%

*** Table ACCESSIONTOGOID ********************************************

Percentage of total database..........  52.7%
Number of entries..................... 68037552
Bytes of storage consumed............. 1187623936
Bytes of payload...................... 678849900   57.2%
Average payload per entry............. 9.98
Average unused bytes per entry........ 0.20
Average fanout........................ 90.00
Fragmentation.........................   6.2%
Maximum payload per entry............. 11
Entries that use overflow............. 0            0.0%
Index pages used...................... 12856
Primary pages used.................... 1146933
Overflow pages used................... 0
Total pages used...................... 1159789
Unused bytes on index pages........... 1572145     11.9%
Unused bytes on primary pages......... 12284825     1.0%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 13856970     1.2%

*** Table ACCESSIONS *************************************************

Percentage of total database..........  12.1%
Number of entries..................... 11040350
Bytes of storage consumed............. 273157120
Bytes of payload...................... 189809909   69.5%
Average payload per entry............. 17.19
Average unused bytes per entry........ 0.31
Average fanout........................ 91.00
Fragmentation.........................  15.7%
Maximum payload per entry............. 41
Entries that use overflow............. 0            0.0%
Index pages used...................... 2916
Primary pages used.................... 263839
Overflow pages used................... 0
Total pages used...................... 266755
Unused bytes on index pages........... 360811      12.1%
Unused bytes on primary pages......... 3081726      1.1%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 3442537      1.3%

*** Table GOID *******************************************************

Percentage of total database..........  35.2%
Number of entries..................... 34018776
Bytes of storage consumed............. 794061824
Bytes of payload...................... 538229202   67.8%
Average payload per entry............. 15.82
Average unused bytes per entry........ 0.17
Average fanout........................ 90.00
Fragmentation.........................   7.6%
Maximum payload per entry............. 34
Entries that use overflow............. 0            0.0%
Index pages used...................... 8571
Primary pages used.................... 766880
Overflow pages used................... 0
Total pages used...................... 775451
Unused bytes on index pages........... 1049068     12.0%
Unused bytes on primary pages......... 4903107      0.62%
Unused bytes on overflow pages........ 0
Unused bytes on all pages............. 5952175      0.75%


-- 
Christopher Mason   Proteome Software    (503) 244-6027
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to