Can someone explain to me why, after populating a new table in a new database file, with no indexes, that the "overhead" is around 50%?
That's the question.. to understand my justification for the figure, I've explained it below. Not a complaint, just would like to understand why and how I may possibly help the situation. Thanks for any help! // CHRIS .. .. **************** * * * BACKGROUND * * * **************** I've written a program called Power-Grab that downloads binary files from usenet newsgroups. I keep all my data in basically flat files of custom-serialized objects. For the most part, it's simple and suffices. One problem I face is the sorting of headers (by Subject:) after I scan a newsgroup. I wrote my own implementation of QuickSort that sorts objects that actually point to records on disk; so it's basically a "disk-based" QuickSort, with some very simple caching to help performance. I've always wanted to find a good relatively fast database engine that I could embed into my program without a lot of overhead or cost. Offerings I found were either free and not useful or useful and expensive (eg, CodeBase.) Looks like SQLite is exactly what I've been waiting for (over 4 years now!) I just started playing around with it, and I'm trying to determine if I can really use it (performance-wise.) For Reference: System: Windows 2000 Pro (With SP4), P4 @ 2.4 GHz, 1GB RAM Language: Visual C++ 7.0 SQLite: Version 2.8.13, using sqlite.dll from www.sqlite.org **************** * * * THE PROBLEM * * * **************** I have a data file that represents the real data from one of the larger newsgroups: Filename: alt.binaries.cd.image.xbox.XOV Size: 376,878,446 Records: 1,932,225 The file is basically the raw output from an "XOVER" command to a news server. It consists of records like this: { int FLAGS, int ID, int LINES, int BYTES, char[] SUBJECT, char[] SENDER, char[] MSGID } The goal is relatively simple: (1) Create a database file (2) Dump all records to the database file (3) Create an index **************** * * * FIRST TRY * * * **************** < Created Database File "XOVER.DB" > CREATE TABLE XOVER (ID INT, FLAGS INT, LINES INT, BYTES INT, SUBJECT TEXT, SENDER TEXT, MSGID TEXT); PRAGMA synchronous = OFF; (repeat 1,932,225 times) INSERT INTO XOVER (ID, FLAGS, LINES, BYTES, SUBJECT, SENDER, MSGID) VALUES (%d,%d,%d,%d,'%s','%s','%s'); CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT); **************** * * * RESULTS * * * **************** First, reading the XOV data file and doing all the formatting but NOT calling SQLite, it took between 48 and 51 seconds (I tested 3 times.) So we can safely subtract that from the run-time to get an estimate of how long it takes SQLite to populate the table. INSERT 1,932,225 RECORDS TIME: 3 Hours, 39 Minutes, 24 Seconds (13,164 Sec is about 146.78 INSERT/sec) CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT) TIME: 10 Minutes, 24 Seconds SELECT * FROM XOVER; (using Callback function that only does "++nRecords;") TIME: 4 Minutes, 31 Seconds The SQLite database file was a bit larget than I expected, after all said an done: 728,733,696 XOVER.DB (Before Index Built) 980,514,816 XOVER.DB (After Index Built) Compared to the original file: 376,878,446 alt.binaries.cd.image.xbox.XOV Not excited about the time to create the table; about 3.5 hours to add about 2 million records. **************** * * * SECOND TRY * * * **************** Same as the FIRST TRY, but wrapped in a transaction: < Created Database File "XOVER.DB" > CREATE TABLE XOVER (ID INT, FLAGS INT, LINES INT, BYTES INT, SUBJECT TEXT, SENDER TEXT, MSGID TEXT); PRAGMA synchronous = OFF; BEGIN; (repeat 1,932,225 times) INSERT INTO XOVER (ID, FLAGS, LINES, BYTES, SUBJECT, SENDER, MSGID) VALUES (%d,%d,%d,%d,'%s','%s','%s'); COMMIT; BEGIN; CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT); COMMIT; **************** * * * RESULTS * * * **************** Great- Much better performance! INSERT 1,932,225 RECORDS TIME: 3 Minutes, 49 Seconds (229 Sec ==> 8,437 INSERT/sec) CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT) TIME: 7 Minutes, 14 Seconds SELECT * FROM XOVER; (using Callback function that only does "++nRecords;") TIME: 2 Minutes, 36 Seconds **************** * * * CONCLUSION * * * **************** As I said, the SQLite database file was a bit larger than I expected, can anyone shed some light on this? The original file: 376,878,446 alt.binaries.cd.image.xbox.XOV Is composed mainly of text. I realize that SQLite stores everything as text (eg, numbers are in ascii text format). But I didn't expect such a large increase in size. Some rough calculations.. The original (XOV) file consists of 1,932,225 records of the following format: { DWORD dwFlags ; Flags DWORD dwID ; News Server Generated Message Number DWORD dwLines ; Number of Lines In Article DWORD dwBytes ; Number of Bytes in Article DWORD dwSL_Subj (=J) ; String Length Of Subject (No NULL Termination) DWORD dwSL_Sender (=K) ; String Length Of Sender (No NULL Termination) DWORD dwSL_MSGID (=L) ; String Length Of Message-ID (No NULL Termination) CHAR[J] szSubject ; Article Subject CHAR[K] szSender ; Sender Name CHAR[L] szMSGID ; Value of "Message-ID:" Header } The file starts with two DWORDS, which give the MIN(dwID) and MAX(dwID) of all the records in the file. With a file size of 376,878,446, we take away 8 bytes for the file header and have 376,878,438 bytes of record data. There are 1,932,225 records so, counting the 7 DWORDS in each record, there are 28 bytes of binary DWORD data per record. That's 1,932,225 * 28 is 54,102,300 bytes of DWORD data... and the rest, 322,776,138 bytes, is TEXT. So from the SQLite XOVER.DB file, WITHOUT the index built, we have a file size of 728,733,696 bytes. Take away the known TEXT data of 322,776,138 and we have 405,957,558 bytes. I am only inserting the first four DWORD values into the SQLite table, so expressed as text, it uses up to 10 characters each - or 40 characters. So that's 40 characters times 1,932,225 records and we get at most 77,289,000 bytes of "numeric" text. So the original SQLite DB file size of 728,733,696 minutes the known data sizes (322,776,138 bytes for text and at most 77,289,000 bytes for numeric values as text) leaves leaves 328,668,558 bytes. I'm not exactly sure what all that space is used for; I assume B-Tree data/nodes and the like. It does seem like quite a bit, though -- of the total file size, the data takes at most ~55% and the SQLite meta-data takes at least ~45% (in my case, at least.) Over 45% overhead? Actually, I would say 50% or more.. I know that most of the DWORD values are not large.. dwFlags is always 0, 1 or 3. dwLines is almost guaranteed to be 5 digits or less, and bytes is almost guaranteed to be 7 digits or less. The ID is usually 6 to 8 digits. // // THE END! // --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]