>But what about: > >I am very interested to know if it would be possible to use an FTS indexing >module to store the inverted index only, but >not the document's text. This would safe disk space if the text to index is >stored on disk rather than inside the database.
This is possible with just minor modifications to fts2.c (below). I commented out the instructions responsible for inserting and updating the text body into the %_content table. As a side effect, the offsets() and snippet() functions stopped working, as they seem to rely on the presence of the full document text in the current implementation. Neverthelses, I ran FTS2 over a collection of source code files, and the results are astonishing: With the original fts2.c, the database figures are as follows: Number of documents: 10739 Files Total size of document text stored: 234 MB Total size of database: ===> 295 MB <=== Size of index within database: 61 MB Index / Text ratio: 26 Percent With the modified fts2.c (no text stored), the database size was obviously much smaller: Number of documents: 10739 Files Total size of document text stored: 234 MB Total size of database: ===> 61 MB <=== Index / Text ratio: 26 Percent I addition to the database size savings, I can think of a number of other benefits in separating text and reverted index storage: 1. Indexing docuements stored in another database would not need to duplicate storage. A small "FTS database" could be attached to the "Data database" if necessary, so the "data" database stays smaller without the index. Deleting the "FTS database" would leave the the data untouched. 2. Point 1 from above would allow to distribute CDs without FTS and let the user create a small FTS index on local storage to speed up searching. This way more data can be shipped on single CD volumes. 3. Indexing compressed text would become possible. The current implementation does not allow text compression because the FTS tables always store uncompressed. 4. Ease maintainance and consistency of data as long as FTS is experimental. If data and FTS are separated, only the FTS index must be rebuild if FTS changes, while the current implementation potentially requires to upgrade entire tables to yet unknown formats. 5. FTS could be removed from a database without touching the data: Only the FTS tables would have to be deleted. Concluding: Given the great database size savings possible by separating full text index from data storage, I wish that developers would consider adding such an option to the SQLite FTS interface. Finally, here are the changes I applied to fts2.c as proof of concept: /* insert into %_content (rowid, ...) values ([rowid], [pValues]) */ static int content_insert(fulltext_vtab *v, sqlite3_value *rowid, sqlite3_value **pValues){ sqlite3_stmt *s; int i; int rc = sql_get_statement(v, CONTENT_INSERT_STMT, &s); if( rc!=SQLITE_OK ) return rc; rc = sqlite3_bind_value(s, 1, rowid); if( rc!=SQLITE_OK ) return rc; /* for(i=0; i<v->nColumn; ++i){ rc = sqlite3_bind_value(s, 2+i, pValues[i]); if( rc!=SQLITE_OK ) return rc; } */ return sql_single_step_statement(v, CONTENT_INSERT_STMT, &s); } /* update %_content set col0 = pValues[0], col1 = pValues[1], ... * where rowid = [iRowid] */ static int content_update(fulltext_vtab *v, sqlite3_value **pValues, sqlite_int64 iRowid){ sqlite3_stmt *s; int i; int rc = sql_get_statement(v, CONTENT_UPDATE_STMT, &s); if( rc!=SQLITE_OK ) return rc; /* for(i=0; i<v->nColumn; ++i){ rc = sqlite3_bind_value(s, 1+i, pValues[i]); if( rc!=SQLITE_OK ) return rc; } */ rc = sqlite3_bind_int64(s, 1+v->nColumn, iRowid); if( rc!=SQLITE_OK ) return rc; return sql_single_step_statement(v, CONTENT_UPDATE_STMT, &s); } Ralf ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------