Re: [sqlite] FTS statistics and stemming

Scott Hess Fri, 25 Jul 2008 08:41:15 -0700

Sorry for the long delay - I was out-of-town, and on return my email
box had seemingly grown without bound.  So I've been working my LIFO
way through giving each email it's five minutes of fame, and here we
are...

On Sat, Jul 5, 2008 at 9:30 PM, Stephen Woodbridge
<[EMAIL PROTECTED]> wrote:
> My particular use for FTS is not document text but addresses and it
> would be very useful if there were a way to analyze the FTS index to get
> statistics on the keys. I could then use this information to make a
> custom parser/stemmer that could eliminate stop words.
>
> For example, Rd, road, st, street, etc would be overly represented and
> not very discriminating, so these should/could be removed. Ideally this
> list should be generated based on loading the data, the analyzing the
> index, then updating the stemmer to remove the new stop works and again
> analyzing and adjusting if needed.
>
> Is this possible? How?
>
> If I had to code this where would I start, I would like to get a list of
> the keys and a count of how many rows that a given key is represented
> in. I assume a token that is represented multiple times in a document is
> represented by a list of offsets, so I can also get a count of the
> number of time it show in each document somehow. I think I have figured
> this much out by reading all the posts on FTS in the archive.

You're (maybe) in luck!  I recently added an optimize() function to
fts3, which merges all segments into one.  As part of this, I added
dump_terms() and dump_doclist() functions, only available in test
mode, so that I could add tests that make assertions about the state
of the fts3 index.  dump_terms() can be used to dump the terms from
the fts3 index, while dump_doclist() can be used to dump the hits.
It's probably not a perfect match for what you want to do, but you
could probably start with something like:

  SELECT dump_terms(t) FROM t LIMIT 1;

This will generate a space-separated list of unique terms from t's
index.  Not so useful if your tokenizer allows spaces in words, but
this is an index-debugging function.  Given those words, you can pass
each term into:

  SELECT dump_doclist(t, ?) FROM t LIMIT 1;

to get the doclist.  Depending on your schema, the doclist will look
something like:

    [2 0[3] 1[2 9 14]] [12 1[1 4]]

The inner [N N N] sequences are the positions of hits in a column,
there can be 1 or more, they should be in order.  The number before
the inner [] is the column number in the row, there can be one for
each column, if there are none this is a delete document (shouldn't
see that case in the dump_doclist() call described above).  The number
after the opening of the outer [] is the docid of the doclist.  There
can be zero or more doclists.  The above has 1 hit for column 0 of
docid 2, 3 hits for column 1 of docid 2, and 2 hits for column 1 of
docid 12.

So, you could probably either compile fts3.c with SQLITE_TEST and hack
something together in your code, or go in and hack your own function
using the dump_terms() and dump_doclist() implementations as examples
of how to walk the index.  It would be pretty quick and easy to write
something to analyze an fts3 index using tclsqlite3, I think.

NOTE: These functions are for testing purposes, and are not intended
for production use.  They scan the entire index, and zero effort has
been spent on making them efficient in any way.  If I think of a
different/better way to write the tests, I might just remove them.
It's possible similar functionality might be exposed by a future
iteration of fts, but these functions are not an early version of
that.

-scott
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] FTS statistics and stemming

Reply via email to