Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

Ralf Junker Tue, 13 Mar 2007 11:12:55 -0800

Hello Scott,

I was hoping that you would read my message, many thanks for your reply!


>UPDATE and DELETE need to have the previous document text, because the
>docids are embedded in the index, and there is no docid->term index
>(or, put another way, the previous document text _is_ the docid->term
>index).

This is very understandable given the present design.

>Keeping track of that information would probably double the
>size of the index.

With your estimate, the SQLite full text index (without document storage) would 
still take up only 50% of the documents' size. In my opinion, this is still a 
very good ratio, even if some specialized full text search engines apparently 
get away with less than 30%. I think you have done an enourmous job on FTS2!

I am optimistic that the proper implementation will use even less than 50%: My 
modifications are completely rudimentary and not at all optimized - the column 
to store the document text still exists. The only difference is that it is not 
used - it stores a null value which could be saved. In fact, the entire FTS 
table (the one without the suffixes) would not be needed and cut down storage 
space.

>A thing I've considered doing is to keep deletions
>as a special index to the side,

Would this open the door to "insert only, but no-modify and no-delete" indexes? 
I am sure users would like pay this cost for the benefit of even smaller FTS 
indexes!

>which would allow older data to be
>deleted during segment merges.  Unfortunately, I suspect that this
>would slow things down by introducing another bit of data which needs
>to be considered during merges.

I found that _not_ adding the original text turned out to be a great time 
saver. This makes sense if we know that the original text is about 4 times the 
size of the index. Storing lots of text by itself is already quite time 
consuming even without creating a FTS index. So I do not expect really bad slow 
downs by adding a docid->term index.

>Of course, there's no way the current system could generate snippets
>without the original text, because doclists don't record the set of
>adjacent terms.  That information could be recorded, but it's doubtful
>that doing so would be an improvement on simply storing the original
>text in the first place.  The current system _does_ have everything
>needed to generate the offsets to hits even without the original text,
>so the client application could generate snippets, though the code is
>not currently in place to expose this information.

Snippets are of course nice to have out of the box as it is right now. But even 
without storing the original text, snippets could be created by

1. supplying the text through other means (additional parameter or callback 
function), so that not FTS but the application would read it from a disk file 
or decompress it from a database field.

2. constructing token-only snippets from the document tokens and offsets. This 
would of course exclude all non-word characters, but would still return legible 
information.

>Being able to have an index without storing the original data was a
>weak goal when fts1 was being developed, but every time we visitted
>it, we found that the negatives of that approach were substantial
>enough to discourage us for a time.  [The "we" in that sentence means
>"me and the various people I run wacky ideas past."]  I'm keeping an
>eye out for interesting implementation strategies and the time to
>explore them, though.

Maybe my arguments could influence the opinion of "we"? I would love to see FTS 
without text storage, especially since I just lost a project to another FTS 
product because duplicating data was unfortunately "out of disk space".

All the best and keep up your good work,

Ralf  


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

Reply via email to