>But what about:
>
>I am very interested to know if it would be possible to use an FTS indexing 
>module to store the inverted index only, but
>not the document's text. This would safe disk space if the text to index is 
>stored on disk rather than inside the database.

This is possible with just minor modifications to fts2.c (below). I commented 
out the instructions responsible for inserting and updating the text body into 
the %_content table. As a side effect, the offsets() and snippet() functions 
stopped working, as they seem to rely on the presence of the full document text 
in the current implementation. Neverthelses, I ran FTS2 over a collection of 
source code files, and the results are astonishing:

With the original fts2.c, the database figures are as follows: 

Number of documents:                10739 Files
Total size of document text stored:   234 MB
Total size of database:          ===> 295 MB <=== 
Size of index within database:         61 MB
Index / Text ratio:                    26 Percent

With the modified fts2.c (no text stored), the database size was obviously much 
smaller:

Number of documents:                10739 Files
Total size of document text stored:   234 MB
Total size of database:           ===> 61 MB <=== 
Index / Text ratio:                    26 Percent

I addition to the database size savings, I can think of a number of other 
benefits in separating text and reverted index storage:

1. Indexing docuements stored in another database would not need to duplicate 
storage. A small "FTS database" could be attached to the "Data database" if 
necessary, so the "data" database stays smaller without the index.  Deleting 
the "FTS database" would leave the the data untouched.

2. Point 1 from above would allow to distribute CDs without FTS and let the 
user create a small FTS index on local storage to speed up searching. This way 
more data can be shipped on single CD volumes.

3. Indexing compressed text would become possible. The current implementation 
does not allow text compression because the FTS tables always store 
uncompressed.

4. Ease maintainance and consistency of data as long as FTS is experimental. If 
data and FTS are separated, only the FTS index must be rebuild if FTS changes, 
while the current implementation potentially requires to upgrade entire tables 
to yet unknown formats.

5. FTS could be removed from a database without touching the data: Only the FTS 
tables would have to be deleted.

Concluding: Given the great database size savings possible by separating full 
text index from data storage, I wish that developers would consider adding such 
an option to the SQLite FTS interface.

Finally, here are the changes I applied to fts2.c as proof of concept:

/* insert into %_content (rowid, ...) values ([rowid], [pValues]) */
static int content_insert(fulltext_vtab *v, sqlite3_value *rowid,
                          sqlite3_value **pValues){
  sqlite3_stmt *s;
  int i;
  int rc = sql_get_statement(v, CONTENT_INSERT_STMT, &s);
  if( rc!=SQLITE_OK ) return rc;

  rc = sqlite3_bind_value(s, 1, rowid);
  if( rc!=SQLITE_OK ) return rc;

/*  for(i=0; i<v->nColumn; ++i){
    rc = sqlite3_bind_value(s, 2+i, pValues[i]);
    if( rc!=SQLITE_OK ) return rc;
  } */

  return sql_single_step_statement(v, CONTENT_INSERT_STMT, &s);
}

/* update %_content set col0 = pValues[0], col1 = pValues[1], ...
 *                  where rowid = [iRowid] */
static int content_update(fulltext_vtab *v, sqlite3_value **pValues,
                          sqlite_int64 iRowid){
  sqlite3_stmt *s;
  int i;
  int rc = sql_get_statement(v, CONTENT_UPDATE_STMT, &s);
  if( rc!=SQLITE_OK ) return rc;

/*  for(i=0; i<v->nColumn; ++i){
    rc = sqlite3_bind_value(s, 1+i, pValues[i]);
    if( rc!=SQLITE_OK ) return rc;
  } */

  rc = sqlite3_bind_int64(s, 1+v->nColumn, iRowid);
  if( rc!=SQLITE_OK ) return rc;

  return sql_single_step_statement(v, CONTENT_UPDATE_STMT, &s);
}

Ralf 


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to