Hi. I have a database with a table having a text column that averages 7.1 characters per row; there are 11 million rows in this table. In one version of the source file that is used to populate this table, the file is laid out such that the column data is populated in sorted order; in another version it is not sorted. In the sorted version, creating the index takes about a minute; in the unsorted version it takes over 8 hours (I killed it before it finished). I've tried this with a couple different versions of SQLite (3.6.1 and 3.6.7) and with various PRAGMA options (cache_size up to 1M, synchronous OFF, etc). I've tried this on linux 2.6.24 and Darwin 9.6 (MacOS X 10.5.6), and Windows XP with the same result. Interestingly, on Vista, it's much faster, perhaps 10 minutes to create the index. I suspect this is because of it's superior disk caching. By increasing the cache_size to 1M, I can get it to index in about 2.5 minutes, but the process consumes about 600 MB of RAM, and decreasing the cache_size after doesn't cause sqlite to release all of the memory it has used.
This TODO item is of course interesting: * Develop a new sort implementation that does much less disk seeking. Use to improve indexing performance on large tables. Is there any thought as to exactly how/when this might happen? Any other ideas? Thanks a bunch! -c Accessions is the table of interest: CREATE TABLE AccessionToGoID (accession INT, goid INT); CREATE TABLE Accessions (id INT IDENTITY, accession VARCHAR(30) NOT NULL, taxonomy INT); CREATE TABLE GoID (id INT IDENTITY, qualifier VARCHAR(20), goid INT, evidence TINYINT, aspect CHAR(1), source INT); CREATE TABLE Location (url TEXT); CREATE TABLE Sources (id INT, name VARCHAR(20)); CREATE INDEX ACESSIONS_STRING ON ACCESSIONS("accession"); CREATE INDEX ACESSIONS_TO_GOID ON ACCESSIONTOGOID("accession"); CREATE INDEX GOID_ID ON GOID("id"); Page size in bytes.................... 1024 Pages in the whole file (measured).... 2201999 Pages in the whole file (calculated).. 2201998 Pages that store data................. 2201998 100.000% Pages on the freelist (per header).... 0 0.0% Pages on the freelist (calculated).... 1 0.0% Pages of auto-vacuum overhead......... 0 0.0% Number of tables in the database...... 6 Number of indices..................... 0 Number of named indices............... 0 Automatically generated indices....... 0 Size of the file in bytes............. 2254846976 Bytes of user payload stored.......... 1406889293 62.4% *** Page counts for all tables with their indices ******************** ACCESSIONTOGOID....................... 1159789 52.7% GOID.................................. 775451 35.2% ACCESSIONS............................ 266755 12.1% LOCATION.............................. 1 0.0% SOURCES............................... 1 0.0% SQLITE_MASTER......................... 1 0.0% *** All tables ******************************************************* Percentage of total database.......... 100.000% Number of entries..................... 113096707 Bytes of storage consumed............. -2040121344 Bytes of payload...................... 1406889777 -69.0% Average payload per entry............. 12.44 Average unused bytes per entry........ 0.21 Average fanout........................ 90.00 Fragmentation......................... 7.8% Maximum payload per entry............. 135 Entries that use overflow............. 0 0.0% Index pages used...................... 24343 Primary pages used.................... 2177655 Overflow pages used................... 0 Total pages used...................... 2201998 Unused bytes on index pages........... 2982024 12.0% Unused bytes on primary pages......... 20271723 -1.0% Unused bytes on overflow pages........ 0 Unused bytes on all pages............. 23253747 -1.1% *** Table ACCESSIONTOGOID ******************************************** Percentage of total database.......... 52.7% Number of entries..................... 68037552 Bytes of storage consumed............. 1187623936 Bytes of payload...................... 678849900 57.2% Average payload per entry............. 9.98 Average unused bytes per entry........ 0.20 Average fanout........................ 90.00 Fragmentation......................... 6.2% Maximum payload per entry............. 11 Entries that use overflow............. 0 0.0% Index pages used...................... 12856 Primary pages used.................... 1146933 Overflow pages used................... 0 Total pages used...................... 1159789 Unused bytes on index pages........... 1572145 11.9% Unused bytes on primary pages......... 12284825 1.0% Unused bytes on overflow pages........ 0 Unused bytes on all pages............. 13856970 1.2% *** Table ACCESSIONS ************************************************* Percentage of total database.......... 12.1% Number of entries..................... 11040350 Bytes of storage consumed............. 273157120 Bytes of payload...................... 189809909 69.5% Average payload per entry............. 17.19 Average unused bytes per entry........ 0.31 Average fanout........................ 91.00 Fragmentation......................... 15.7% Maximum payload per entry............. 41 Entries that use overflow............. 0 0.0% Index pages used...................... 2916 Primary pages used.................... 263839 Overflow pages used................... 0 Total pages used...................... 266755 Unused bytes on index pages........... 360811 12.1% Unused bytes on primary pages......... 3081726 1.1% Unused bytes on overflow pages........ 0 Unused bytes on all pages............. 3442537 1.3% *** Table GOID ******************************************************* Percentage of total database.......... 35.2% Number of entries..................... 34018776 Bytes of storage consumed............. 794061824 Bytes of payload...................... 538229202 67.8% Average payload per entry............. 15.82 Average unused bytes per entry........ 0.17 Average fanout........................ 90.00 Fragmentation......................... 7.6% Maximum payload per entry............. 34 Entries that use overflow............. 0 0.0% Index pages used...................... 8571 Primary pages used.................... 766880 Overflow pages used................... 0 Total pages used...................... 775451 Unused bytes on index pages........... 1049068 12.0% Unused bytes on primary pages......... 4903107 0.62% Unused bytes on overflow pages........ 0 Unused bytes on all pages............. 5952175 0.75% -- Christopher Mason Proteome Software (503) 244-6027 _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users