Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

Ralf Junker Wed, 14 Mar 2007 01:31:43 -0800

Scott Hess wrote:


>>I am optimistic that the proper implementation will use even less than 50%:
>
>Indeed :-). 

Glad to read this ;-)

>>I found that _not_ adding the original text turned out to be a great time
>>saver. This makes sense if we know that the original text is about 4 times
>>the size of the index. Storing lots of text by itself is already quite time
>>consuming even without creating a FTS index. So I do not expect really
>>bad slow downs by adding a docid->term index.
>
>Are you doing your inserts in the implied transactions sqlite provides
>for you if you didn't open an explicit transaction?  I'm found that
>when doing bulk inserts, the maintenance of the content table is a
>pretty small part of the overall time, perhaps 10%.

My timings vary: I have just measured the insertion speeds with and without 
storing the original text and was _very_ surprised by the results:

WITH    text storage: 1055 KB / sec
WITHOUT text storage: 4948 KB / sec

FTS without text storage performed almost 5 (five!) times faster than with text 
storage (running WinXP on a fairly recent system with a 5200 rotations per sec 
hard drive).

The testing scenario: There were no changes to the code except that I commented 
out the text bindings as described in my earlier message. The same documents 
were indexed (10739 files, 239959 KB size in total). Insertion took place in a 
single transaction, SYNCHRONOUS = OFF was used as the only tweak to the 
database. I ran all tests multiple times consecutively on an empty database to 
avoid OS file buffering interferences.

>>Snippets are of course nice to have out of the box as it is right now. But
>>even without storing the original text, snippets could be created by
>>
>>1. supplying the text through other means (additional parameter or
>>callback function), so that not FTS but the application would read
>>it from a disk file or decompress it from a database field.
>>
>>2. constructing token-only snippets from the document tokens and
>>offsets. This would of course exclude all non-word characters, but
>>would still return legible information.
>
>A use-case that was considered was indexing PDF data, in which case
>the per-document tokenization cost would probably be a couple seconds.
>If you ran a query which matched a couple thousand documents and
>proceeded to re-tokenize them for snippet generation, you'd be in deep
>trouble.  This is somewhat addressable by providing scoring mechanisms
>and using subselects (basically, have the subselect order by score,
>then cap the number of results, and have the main select ask for
>snippets).  A variant on that would be an index of a CD.  In that case
>it's pretty much essential that the index be able to efficiently
>answer questions without having to seek all over the disk.

Quite true.  But is this indeed a realistic scenario? It sounds a bit like the 
"select * from my-million-row-table" problem. Nothing wrong with this per se, 
but be aware of the consequences.

>Option 2 has some attraction, though, because you have the option of
>transparently segmenting the document into blocks and thus not having
>to re-tokenize the entire document to generate snippets.

Thanks!

>>>Being able to have an index without storing the original data was a
>>>weak goal when fts1 was being developed, but every time we visitted
>>>it, we found that the negatives of that approach were substantial
>>>enough to discourage us for a time.  [The "we" in that sentence means
>>>"me and the various people I run wacky ideas past."]  I'm keeping an
>>>eye out for interesting implementation strategies and the time to
>>>explore them, though.
>>
>>Maybe my arguments could influence the opinion of "we"? I would love
>>to see FTS without text storage, especially since I just lost a project to
>>another FTS product because duplicating data was unfortunately "out
>>of disk space".
>
>Feel free to drop me a description of the types of things you're doing
>out-of-band, maybe something will gel.  No promises!  Most of the
>current use-cases are pretty clear - since the data is already going
>to be in the database, letting fts2 store it is no big deal.  I can
>imagine pretty broad classes of problems which could come up when
>indexing data which is not in the database, so one of the challanges
>is to narrow down which problems are real, and which are figments.

I conclude from your remarks that the offsets() problem is not predominant and 
could be solved even without storing full text in the database. If so, snippets 
could be created as well from those offsets. I realize that this will 
commplicate the FTS2 implementation, so please excuse if I am arguing from a 
user's perspective.

For users, I can see the following benefits in separating FTS index and 
original text:

* Space savings when indexing external documents not stored in the database.

* Possibility to add FTS to text stored in compressed format in the database.

* Possibility to mix FTS text rows with numeric or blob rows in a single table. 
The current implementation does not allow INTEGERs or BLOBs in FTS virtual 
tables.

* FTS indexs could be easily deleted without touching the real data.

* FTS indexes could be maintained outside the main data database, for example 
in an attached databases.

* If there was a FTS API, it could be used to add full text search to other 
VIRTUAL TABLEs, like to provide FTS to *.dbf databases, etc.

That's my list for the moment. But many ideas around a new technology emerge 
only after it is available (take the laser, or even SQLite, for an example). So 
if you can see at least some benetif in index-data-separation, I would be glad 
if you could persue this idea further. I might not be of great help in this 
right now, but would be willing to learn ;-)

Regards,

Ralf 


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words

Reply via email to