On 3/13/07, Ralf Junker <[EMAIL PROTECTED]> wrote:
Scott Hess wrote:
>Keeping track of that information would probably double the
>size of the index.
With your estimate, the SQLite full text index (without document storage) would
still take up only 50% of the documents' size. In my opinion, this is still a
very
good ratio, even if some specialized full text search engines apparently get
away with less than 30%.
It's a pretty raw estimate, though, and assumes that we only bother to
store the set of terms per document, without retaining ordering.
Since it takes X space to store the terms with their list of docids,
it should take approximately X space to store the docids with their
list of terms, suitably encoded. Actually, a bit of a win could be
had because the reverse mapping need not retain the positions, and the
termids should encode more tightly than docids, since there are fewer
of them.
I am optimistic that the proper implementation will use even less than 50%:
Indeed :-). I've mostly been trying to stay on the positive side of
the 80/20 rule, and avoid some of the complications that come along
with some of the bigger dedicated systems.
My modifications are completely rudimentary and not at all optimized - the
column to store the document text still exists. The only difference is that it
is not used - it stores a null value which could be saved. In fact, the entire
FTS table (the one without the suffixes) would not be needed and cut down
storage space.
The overhead of the table should be pretty minimal, and is providing
the useful function of docid assignment. A complete implementation
might be able to fold docid assignment into the internal data
structures, but that might be counterproductive (it's surely useful to
have docids assigned in the exact same way as sqlite assigns rowids).
>A thing I've considered doing is to keep deletions
>as a special index to the side,
Would this open the door to "insert only, but no-modify and no-delete"
indexes? I am sure users would like pay this cost for the benefit of even
smaller FTS indexes!
You're 90% there for INSERT-only - the above is a notion for how one
could handle UPDATE and DELETE without storing the content data.
>which would allow older data to be
>deleted during segment merges. Unfortunately, I suspect that this
>would slow things down by introducing another bit of data which needs
>to be considered during merges.
I found that _not_ adding the original text turned out to be a great time
saver. This makes sense if we know that the original text is about 4 times
the size of the index. Storing lots of text by itself is already quite time
consuming even without creating a FTS index. So I do not expect really
bad slow downs by adding a docid->term index.
Are you doing your inserts in the implied transactions sqlite provides
for you if you didn't open an explicit transaction? I'm found that
when doing bulk inserts, the maintenance of the content table is a
pretty small part of the overall time, perhaps 10%.
Snippets are of course nice to have out of the box as it is right now. But
even without storing the original text, snippets could be created by
1. supplying the text through other means (additional parameter or
callback function), so that not FTS but the application would read
it from a disk file or decompress it from a database field.
2. constructing token-only snippets from the document tokens and
offsets. This would of course exclude all non-word characters, but
would still return legible information.
A use-case that was considered was indexing PDF data, in which case
the per-document tokenization cost would probably be a couple seconds.
If you ran a query which matched a couple thousand documents and
proceeded to re-tokenize them for snippet generation, you'd be in deep
trouble. This is somewhat addressable by providing scoring mechanisms
and using subselects (basically, have the subselect order by score,
then cap the number of results, and have the main select ask for
snippets). A variant on that would be an index of a CD. In that case
it's pretty much essential that the index be able to efficiently
answer questions without having to seek all over the disk.
Option 2 has some attraction, though, because you have the option of
transparently segmenting the document into blocks and thus not having
to re-tokenize the entire document to generate snippets. This extends
the b-tree characteristics a little further into the data. [Of
course, you'd really just store things in a form which made snippets
easy to generate without re-tokenizing at all, but that's besides the
point.]
>Being able to have an index without storing the original data was a
>weak goal when fts1 was being developed, but every time we visitted
>it, we found that the negatives of that approach were substantial
>enough to discourage us for a time. [The "we" in that sentence means
>"me and the various people I run wacky ideas past."] I'm keeping an
>eye out for interesting implementation strategies and the time to
>explore them, though.
Maybe my arguments could influence the opinion of "we"? I would love
to see FTS without text storage, especially since I just lost a project to
another FTS product because duplicating data was unfortunately "out
of disk space".
Feel free to drop me a description of the types of things you're doing
out-of-band, maybe something will gel. No promises! Most of the
current use-cases are pretty clear - since the data is already going
to be in the database, letting fts2 store it is no big deal. I can
imagine pretty broad classes of problems which could come up when
indexing data which is not in the database, so one of the challanges
is to narrow down which problems are real, and which are figments.
-scott
-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------