Sorry for the delayed response. Was waiting for time to dig into this a little.
Nov 18, 2007 2:05 AM Wang Yun <[EMAIL PROTECTED]>: > I insert rfc txt files into a full text search table, 4119 txt files are > 188MB totally. After insert, database file is 443MB. > Logic is below, it's not the real code. I've attached a tcl script I used in experimenting with this. I'm doing this on a Redhat 9 box (kernel 2.4.22 with some additional patches), running on a local disk both for input and the database. The version of rfc I'm using has 4756 files with 252M of data. I compiled tclsqlite3 with flags: -O6 -DNDEBUG=1 -DHAVE_FDATASYNC=1 -DSQLITE_CORE=1 -DSQLITE_ENABLE_FTS1=1 -DSQLITE_ENABLE_BROKEN_FTS1=1 -DSQLITE_ENABLE_FTS2=1 -DSQLITE_ENABLE_BROKEN_FTS2=1 -DSQLITE_ENABLE_FTS3=1 -DHAVE_DLOPEN=1 [Listing all this detail because various things can cause various differences. The most-relevant parameters are probably -O6 -DNDEBUG=1.] > This cost 154 seconds, I use fts2 and my PC is Intel 2.33GHz, 2 CPUs. > If I don't use fts, just insert into normal table, will cost 11 seconds. Your results are a bit worse than I'd expect. I get 6.2s for the simple-table case, and about 45s for the fts3 case, not quite your 14x slow-down. fts2 and fts3 should be nearly identical for this kind of test. Basic table, one transaction: 2.190u 1.060s 0:06.20 52.4% 0+0k 0+0io 393pf+0w -rw-r--r-- 1 shess eng 239505408 Dec 4 11:15 dbs/baseline.db Basic table, per-insert transaction: 3.280u 1.890s 0:34.19 15.1% 0+0k 0+0io 393pf+0w -rw-r--r-- 1 shess eng 239505408 Dec 4 11:16 dbs/baseline.db fts3, one transaction: 40.820u 1.550s 0:44.44 95.3% 0+0k 0+0io 414pf+0w -rw-r--r-- 1 shess eng 307900416 Dec 4 11:17 dbs/baseline.db fts3, per-insert transaction: 61.300u 3.330s 1:43.04 62.7% 0+0k 0+0io 414pf+0w -rw-r--r-- 1 shess eng 348348416 Dec 4 11:21 dbs/baseline.db I'm not seeing as much bloat as you describe - perhaps I'm using a bigger page size. > I don't know when sqlite will update the full text index, after each insert? fts2/3 update the index after each transaction, and also before each insert where an explicit rowid is less than the maximum rowid seen in the current transaction, and also before any query run against the table. In this case, it should only be updating at the end of the transaction, and as needed when the in-memory table fills up. > How can I improve the performance? The optimal case for fts2/3 is to do many inserts per transaction, letting the table select rowid/docid. In that case it will collect the new data in memory and flush it to disk less frequently. For a test of this size, it might also help to have a bigger page cache so that SQLite doesn't have to flush the journal file to disk. You could define kPendingThreshold in the fts3.c (or fts2.c) source code to be larger (this doesn't seem to help me at all, though). Without going in and profiling things, though, my guess is that the time in this test is dominated by tokenization. I'm basing this on the high CPU utilization, and the lack of impact from tweaking kPendingThreshold and pragma cache_size. Possibly there's also some cost from segment merges, though those should generally be fairly I/O dominated. -scott
----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------