Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-14 Thread Ralf Junker
Scott Hess wrote:


>>I am optimistic that the proper implementation will use even less than 50%:
>
>Indeed :-). 

Glad to read this ;-)

>>I found that _not_ adding the original text turned out to be a great time
>>saver. This makes sense if we know that the original text is about 4 times
>>the size of the index. Storing lots of text by itself is already quite time
>>consuming even without creating a FTS index. So I do not expect really
>>bad slow downs by adding a docid->term index.
>
>Are you doing your inserts in the implied transactions sqlite provides
>for you if you didn't open an explicit transaction?  I'm found that
>when doing bulk inserts, the maintenance of the content table is a
>pretty small part of the overall time, perhaps 10%.

My timings vary: I have just measured the insertion speeds with and without 
storing the original text and was _very_ surprised by the results:

WITHtext storage: 1055 KB / sec
WITHOUT text storage: 4948 KB / sec

FTS without text storage performed almost 5 (five!) times faster than with text 
storage (running WinXP on a fairly recent system with a 5200 rotations per sec 
hard drive).

The testing scenario: There were no changes to the code except that I commented 
out the text bindings as described in my earlier message. The same documents 
were indexed (10739 files, 239959 KB size in total). Insertion took place in a 
single transaction, SYNCHRONOUS = OFF was used as the only tweak to the 
database. I ran all tests multiple times consecutively on an empty database to 
avoid OS file buffering interferences.

>>Snippets are of course nice to have out of the box as it is right now. But
>>even without storing the original text, snippets could be created by
>>
>>1. supplying the text through other means (additional parameter or
>>callback function), so that not FTS but the application would read
>>it from a disk file or decompress it from a database field.
>>
>>2. constructing token-only snippets from the document tokens and
>>offsets. This would of course exclude all non-word characters, but
>>would still return legible information.
>
>A use-case that was considered was indexing PDF data, in which case
>the per-document tokenization cost would probably be a couple seconds.
>If you ran a query which matched a couple thousand documents and
>proceeded to re-tokenize them for snippet generation, you'd be in deep
>trouble.  This is somewhat addressable by providing scoring mechanisms
>and using subselects (basically, have the subselect order by score,
>then cap the number of results, and have the main select ask for
>snippets).  A variant on that would be an index of a CD.  In that case
>it's pretty much essential that the index be able to efficiently
>answer questions without having to seek all over the disk.

Quite true.  But is this indeed a realistic scenario? It sounds a bit like the 
"select * from my-million-row-table" problem. Nothing wrong with this per se, 
but be aware of the consequences.

>Option 2 has some attraction, though, because you have the option of
>transparently segmenting the document into blocks and thus not having
>to re-tokenize the entire document to generate snippets.

Thanks!

>>>Being able to have an index without storing the original data was a
>>>weak goal when fts1 was being developed, but every time we visitted
>>>it, we found that the negatives of that approach were substantial
>>>enough to discourage us for a time.  [The "we" in that sentence means
>>>"me and the various people I run wacky ideas past."]  I'm keeping an
>>>eye out for interesting implementation strategies and the time to
>>>explore them, though.
>>
>>Maybe my arguments could influence the opinion of "we"? I would love
>>to see FTS without text storage, especially since I just lost a project to
>>another FTS product because duplicating data was unfortunately "out
>>of disk space".
>
>Feel free to drop me a description of the types of things you're doing
>out-of-band, maybe something will gel.  No promises!  Most of the
>current use-cases are pretty clear - since the data is already going
>to be in the database, letting fts2 store it is no big deal.  I can
>imagine pretty broad classes of problems which could come up when
>indexing data which is not in the database, so one of the challanges
>is to narrow down which problems are real, and which are figments.

I conclude from your remarks that the offsets() problem is not predominant and 
could be solved even without storing full text in the database. If so, snippets 
could be created as well from those offsets. I realize that this will 
commplicate the FTS2 implementation, so please excuse if I am arguing from a 
user's perspective.

For users, I can see the following benefits in separating FTS index and 
original text:

* Space savings when indexing external documents not stored in the database.

* Possibility to add FTS to text stored in compressed format in the database.

* Possibility to mix FTS text rows with 

Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-13 Thread Ralf Junker
Hello Scott,

I was hoping that you would read my message, many thanks for your reply!

>UPDATE and DELETE need to have the previous document text, because the
>docids are embedded in the index, and there is no docid->term index
>(or, put another way, the previous document text _is_ the docid->term
>index).

This is very understandable given the present design.

>Keeping track of that information would probably double the
>size of the index.

With your estimate, the SQLite full text index (without document storage) would 
still take up only 50% of the documents' size. In my opinion, this is still a 
very good ratio, even if some specialized full text search engines apparently 
get away with less than 30%. I think you have done an enourmous job on FTS2!

I am optimistic that the proper implementation will use even less than 50%: My 
modifications are completely rudimentary and not at all optimized - the column 
to store the document text still exists. The only difference is that it is not 
used - it stores a null value which could be saved. In fact, the entire FTS 
table (the one without the suffixes) would not be needed and cut down storage 
space.

>A thing I've considered doing is to keep deletions
>as a special index to the side,

Would this open the door to "insert only, but no-modify and no-delete" indexes? 
I am sure users would like pay this cost for the benefit of even smaller FTS 
indexes!

>which would allow older data to be
>deleted during segment merges.  Unfortunately, I suspect that this
>would slow things down by introducing another bit of data which needs
>to be considered during merges.

I found that _not_ adding the original text turned out to be a great time 
saver. This makes sense if we know that the original text is about 4 times the 
size of the index. Storing lots of text by itself is already quite time 
consuming even without creating a FTS index. So I do not expect really bad slow 
downs by adding a docid->term index.

>Of course, there's no way the current system could generate snippets
>without the original text, because doclists don't record the set of
>adjacent terms.  That information could be recorded, but it's doubtful
>that doing so would be an improvement on simply storing the original
>text in the first place.  The current system _does_ have everything
>needed to generate the offsets to hits even without the original text,
>so the client application could generate snippets, though the code is
>not currently in place to expose this information.

Snippets are of course nice to have out of the box as it is right now. But even 
without storing the original text, snippets could be created by

1. supplying the text through other means (additional parameter or callback 
function), so that not FTS but the application would read it from a disk file 
or decompress it from a database field.

2. constructing token-only snippets from the document tokens and offsets. This 
would of course exclude all non-word characters, but would still return legible 
information.

>Being able to have an index without storing the original data was a
>weak goal when fts1 was being developed, but every time we visitted
>it, we found that the negatives of that approach were substantial
>enough to discourage us for a time.  [The "we" in that sentence means
>"me and the various people I run wacky ideas past."]  I'm keeping an
>eye out for interesting implementation strategies and the time to
>explore them, though.

Maybe my arguments could influence the opinion of "we"? I would love to see FTS 
without text storage, especially since I just lost a project to another FTS 
product because duplicating data was unfortunately "out of disk space".

All the best and keep up your good work,

Ralf  


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-13 Thread Ralf Junker
Ion Silvestru wrote:

>Just a question: did you eliminated stop-words in your tests?

No, I did not eliminate any stop-words. The two test runs were equal except for 
the small changes in FTS 2.

My stop words question was not intended for source code but for human language 
texts.

Ralf  


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-13 Thread Scott Hess

On 3/13/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

Ion Silvestru <[EMAIL PROTECTED]> wrote:
> To Ralf:
> >As a side effect, the offsets() and snippet() functions stopped working,
> >as they seem to rely on the presence of the full document text in the
> >current implementation.
>
> Did you tested "phrase" searching on the index-only version, didn't this
> kind of search rely on offsets()?

Phrase searches do *not* use the full document text.
But UPDATE and DELETE do, ironically.  Or at least they
used to, unless Scott has changed that in FTS2.


Indeed, phrase searches should continue to work, because since we have
the terms from the query, we can look them up and compare their token
positions in the document (offsets being the character positions of
the tokens).

UPDATE and DELETE need to have the previous document text, because the
docids are embedded in the index, and there is no docid->term index
(or, put another way, the previous document text _is_ the docid->term
index).  Keeping track of that information would probably double the
size of the index.  A thing I've considered doing is to keep deletions
as a special index to the side, which would allow older data to be
deleted during segment merges.  Unfortunately, I suspect that this
would slow things down by introducing another bit of data which needs
to be considered during merges.

Of course, there's no way the current system could generate snippets
without the original text, because doclists don't record the set of
adjacent terms.  That information could be recorded, but it's doubtful
that doing so would be an improvement on simply storing the original
text in the first place.  The current system _does_ have everything
needed to generate the offsets to hits even without the original text,
so the client application could generate snippets, though the code is
not currently in place to expose this information.

Being able to have an index without storing the original data was a
weak goal when fts1 was being developed, but every time we visitted
it, we found that the negatives of that approach were substantial
enough to discourage us for a time.  [The "we" in that sentence means
"me and the various people I run wacky ideas past."]  I'm keeping an
eye out for interesting implementation strategies and the time to
explore them, though.

-scott

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-13 Thread drh
Ion Silvestru <[EMAIL PROTECTED]> wrote:
> To Ralf:
> 
> >As a side effect, the offsets() and snippet() functions stopped working, as 
> >they seem to rely on the presence of the full document text in the current 
> >implementation.
> 
> Did you tested "phrase" searching on the index-only version, didn't this
> kind of search rely on offsets()?
> 

Phrase searches do *not* use the full document text.
But UPDATE and DELETE do, ironically.  Or at least they
used to, unless Scott has changed that in FTS2.
--
D. Richard Hipp  <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-13 Thread Ion Silvestru
To Ralf:

>As a side effect, the offsets() and snippet() functions stopped working, as 
>they seem to rely on the presence of the full document text in the current 
>implementation.

Did you tested "phrase" searching on the index-only version, didn't this
kind of search rely on offsets()?


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re[2]: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-13 Thread Ion Silvestru

>Just a question: did you eliminated stop-words in your tests?

Sorry, you specified that you indexed source code files, so no
stop-words are applicable here.


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-13 Thread Ion Silvestru

Thank you.

Just a question: did you eliminated stop-words in your tests?

>Concluding: Given the great database size savings possible by separating full 
>text index from data storage, I wish that
>developers would consider adding such an option to the SQLite FTS interface.

If such an option will be added, I see a big future for using SQLite
as a simple, but powerful and easily customized (user tokenizers etc)
full-text search engine, and not only as a DB engine.

Currently we don't have many options for full-text desktop engine,
there are some, like DTSearch, Onix, Lucene, but these are
over-priced, can't be easily customized or too complex.



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

2007-03-13 Thread Ralf Junker

>But what about:
>
>I am very interested to know if it would be possible to use an FTS indexing 
>module to store the inverted index only, but
>not the document's text. This would safe disk space if the text to index is 
>stored on disk rather than inside the database.

This is possible with just minor modifications to fts2.c (below). I commented 
out the instructions responsible for inserting and updating the text body into 
the %_content table. As a side effect, the offsets() and snippet() functions 
stopped working, as they seem to rely on the presence of the full document text 
in the current implementation. Neverthelses, I ran FTS2 over a collection of 
source code files, and the results are astonishing:

With the original fts2.c, the database figures are as follows: 

Number of documents:10739 Files
Total size of document text stored:   234 MB
Total size of database:  ===> 295 MB <=== 
Size of index within database: 61 MB
Index / Text ratio:26 Percent

With the modified fts2.c (no text stored), the database size was obviously much 
smaller:

Number of documents:10739 Files
Total size of document text stored:   234 MB
Total size of database:   ===> 61 MB <=== 
Index / Text ratio:26 Percent

I addition to the database size savings, I can think of a number of other 
benefits in separating text and reverted index storage:

1. Indexing docuements stored in another database would not need to duplicate 
storage. A small "FTS database" could be attached to the "Data database" if 
necessary, so the "data" database stays smaller without the index.  Deleting 
the "FTS database" would leave the the data untouched.

2. Point 1 from above would allow to distribute CDs without FTS and let the 
user create a small FTS index on local storage to speed up searching. This way 
more data can be shipped on single CD volumes.

3. Indexing compressed text would become possible. The current implementation 
does not allow text compression because the FTS tables always store 
uncompressed.

4. Ease maintainance and consistency of data as long as FTS is experimental. If 
data and FTS are separated, only the FTS index must be rebuild if FTS changes, 
while the current implementation potentially requires to upgrade entire tables 
to yet unknown formats.

5. FTS could be removed from a database without touching the data: Only the FTS 
tables would have to be deleted.

Concluding: Given the great database size savings possible by separating full 
text index from data storage, I wish that developers would consider adding such 
an option to the SQLite FTS interface.

Finally, here are the changes I applied to fts2.c as proof of concept:

/* insert into %_content (rowid, ...) values ([rowid], [pValues]) */
static int content_insert(fulltext_vtab *v, sqlite3_value *rowid,
  sqlite3_value **pValues){
  sqlite3_stmt *s;
  int i;
  int rc = sql_get_statement(v, CONTENT_INSERT_STMT, );
  if( rc!=SQLITE_OK ) return rc;

  rc = sqlite3_bind_value(s, 1, rowid);
  if( rc!=SQLITE_OK ) return rc;

/*  for(i=0; inColumn; ++i){
rc = sqlite3_bind_value(s, 2+i, pValues[i]);
if( rc!=SQLITE_OK ) return rc;
  } */

  return sql_single_step_statement(v, CONTENT_INSERT_STMT, );
}

/* update %_content set col0 = pValues[0], col1 = pValues[1], ...
 *  where rowid = [iRowid] */
static int content_update(fulltext_vtab *v, sqlite3_value **pValues,
  sqlite_int64 iRowid){
  sqlite3_stmt *s;
  int i;
  int rc = sql_get_statement(v, CONTENT_UPDATE_STMT, );
  if( rc!=SQLITE_OK ) return rc;

/*  for(i=0; inColumn; ++i){
rc = sqlite3_bind_value(s, 1+i, pValues[i]);
if( rc!=SQLITE_OK ) return rc;
  } */

  rc = sqlite3_bind_int64(s, 1+v->nColumn, iRowid);
  if( rc!=SQLITE_OK ) return rc;

  return sql_single_step_statement(v, CONTENT_UPDATE_STMT, );
}

Ralf 


-
To unsubscribe, send email to [EMAIL PROTECTED]
-