Can someone explain to me why, after populating a new table
in a new database file, with no indexes, that the "overhead"
is around 50%?

That's the question.. to understand my justification for the
figure, I've explained it below.  Not a complaint, just would
like to understand why and how I may possibly help the
situation.

Thanks for any help!
// CHRIS

..
..

****************
*              *
* BACKGROUND   *
*              *
****************

I've written a program called Power-Grab that downloads binary
files from usenet newsgroups.  I keep all my data in basically
flat files of custom-serialized objects.  For the most part,
it's simple and suffices.

One problem I face is the sorting of headers (by Subject:) after
I scan a newsgroup.  I wrote my own implementation of QuickSort
that sorts objects that actually point to records on disk; so it's
basically a "disk-based" QuickSort, with some very simple caching
to help performance.

I've always wanted to find a good relatively fast database engine
that I could embed into my program without a lot of overhead or cost.
Offerings I found were either free and not useful or useful and
expensive (eg, CodeBase.)  Looks like SQLite is exactly what I've
been waiting for (over 4 years now!)

I just started playing around with it, and I'm trying to determine
if I can really use it (performance-wise.)

For Reference:

  System:   Windows 2000 Pro (With SP4), P4 @ 2.4 GHz, 1GB RAM
  Language: Visual C++ 7.0
  SQLite:   Version 2.8.13, using sqlite.dll from www.sqlite.org

****************
*              *
* THE PROBLEM  *
*              *
****************

I have a data file that represents the real data from one of the larger
newsgroups:

  Filename: alt.binaries.cd.image.xbox.XOV
  Size: 376,878,446
  Records: 1,932,225

The file is basically the raw output from an "XOVER" command to a news
server.
It consists of records like this:

  { int FLAGS, int ID, int LINES, int BYTES, char[] SUBJECT, char[] SENDER,
char[] MSGID }

The goal is relatively simple:
  (1) Create a database file
  (2) Dump all records to the database file
  (3) Create an index

****************
*              *
* FIRST TRY    *
*              *
****************

  < Created Database File "XOVER.DB" >

  CREATE TABLE XOVER (ID INT, FLAGS INT, LINES INT, BYTES INT, SUBJECT TEXT,
SENDER TEXT, MSGID TEXT);
  PRAGMA synchronous = OFF;

  (repeat 1,932,225 times)
    INSERT INTO XOVER (ID, FLAGS, LINES, BYTES, SUBJECT, SENDER, MSGID)
VALUES (%d,%d,%d,%d,'%s','%s','%s');

  CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT);


****************
*              *
* RESULTS      *
*              *
****************

  First, reading the XOV data file and doing all the formatting but NOT
calling SQLite,
  it took between 48 and 51 seconds (I tested 3 times.)  So we can safely
subtract that
  from the run-time to get an estimate of how long it takes SQLite to
populate the table.

  INSERT 1,932,225 RECORDS

    TIME:  3 Hours, 39 Minutes, 24 Seconds  (13,164 Sec is about 146.78
INSERT/sec)

  CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT)

    TIME:  10 Minutes, 24 Seconds

  SELECT * FROM XOVER;  (using Callback function that only does
"++nRecords;")

    TIME:  4 Minutes, 31 Seconds

  The SQLite database file was a bit larget than I expected, after all said
an done:

 728,733,696 XOVER.DB (Before Index Built)
 980,514,816 XOVER.DB (After Index Built)

  Compared to the original file:

 376,878,446 alt.binaries.cd.image.xbox.XOV

  Not excited about the time to create the table;
  about 3.5 hours to add about 2 million records.


****************
*              *
* SECOND TRY   *
*              *
****************

  Same as the FIRST TRY, but wrapped in a transaction:

  < Created Database File "XOVER.DB" >

  CREATE TABLE XOVER (ID INT, FLAGS INT, LINES INT, BYTES INT, SUBJECT TEXT,
SENDER TEXT, MSGID TEXT);
  PRAGMA synchronous = OFF;

  BEGIN;
  (repeat 1,932,225 times)
    INSERT INTO XOVER (ID, FLAGS, LINES, BYTES, SUBJECT, SENDER, MSGID)
VALUES (%d,%d,%d,%d,'%s','%s','%s');
  COMMIT;

  BEGIN;
  CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT);
  COMMIT;


****************
*              *
* RESULTS      *
*              *
****************

  Great- Much better performance!

  INSERT 1,932,225 RECORDS

    TIME:  3 Minutes, 49 Seconds  (229 Sec ==> 8,437 INSERT/sec)

  CREATE INDEX IDX_SUBJ ON XOVER (SUBJECT)

    TIME:  7 Minutes, 14 Seconds

  SELECT * FROM XOVER;  (using Callback function that only does
"++nRecords;")

    TIME:  2 Minutes, 36 Seconds


****************
*              *
* CONCLUSION   *
*              *
****************

  As I said, the SQLite database file was a bit larger than I expected,
  can anyone shed some light on this?

  The original file:

 376,878,446 alt.binaries.cd.image.xbox.XOV

  Is composed mainly of text.  I realize that SQLite stores everything
  as text (eg, numbers are in ascii text format).  But I didn't expect
  such a large increase in size.

  Some rough calculations..

  The original (XOV) file consists of 1,932,225 records of the following
  format:

 {
   DWORD  dwFlags  ; Flags
   DWORD  dwID   ; News Server Generated Message Number
   DWORD  dwLines  ; Number of Lines In Article
   DWORD  dwBytes  ; Number of Bytes in Article
   DWORD  dwSL_Subj (=J)  ; String Length Of Subject (No NULL Termination)
   DWORD  dwSL_Sender (=K) ; String Length Of Sender (No NULL Termination)
   DWORD  dwSL_MSGID (=L) ; String Length Of Message-ID (No NULL
Termination)
   CHAR[J]  szSubject  ; Article Subject
   CHAR[K]  szSender  ; Sender Name
   CHAR[L]  szMSGID  ; Value of "Message-ID:" Header
 }

  The file starts with two DWORDS, which give the MIN(dwID) and MAX(dwID) of
  all the records in the file.  With a file size of 376,878,446, we take
away
  8 bytes for the file header and have 376,878,438 bytes of record data.
  There are 1,932,225 records so, counting the 7 DWORDS in each record,
there
  are 28 bytes of binary DWORD data per record.  That's 1,932,225 * 28 is
  54,102,300 bytes of DWORD data... and the rest, 322,776,138 bytes, is
TEXT.

  So from the SQLite XOVER.DB file, WITHOUT the index built, we have a file
  size of 728,733,696 bytes.  Take away the known TEXT data of 322,776,138
  and we have 405,957,558 bytes.  I am only inserting the first four DWORD
  values into the SQLite table, so expressed as text, it uses up to
  10 characters each - or 40 characters.  So that's 40 characters times
  1,932,225 records and we get at most 77,289,000 bytes of "numeric" text.

  So the original SQLite DB file size of 728,733,696 minutes the known data
  sizes (322,776,138 bytes for text and at most 77,289,000 bytes for numeric
  values as text) leaves leaves 328,668,558 bytes.  I'm not exactly sure
  what all that space is used for; I assume B-Tree data/nodes and the like.
  It does seem like quite a bit, though -- of the total file size, the
  data takes at most ~55% and the SQLite meta-data takes at least ~45% (in
  my case, at least.)

  Over 45% overhead?  Actually, I would say 50% or more.. I know that
  most of the DWORD values are not large.. dwFlags is always 0, 1 or 3.
  dwLines is almost guaranteed to be 5 digits or less, and bytes is almost
  guaranteed to be 7 digits or less.  The ID is usually 6 to 8 digits.

//
// THE END!
//


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to