Re: [sqlite] FTS statistics and stemming

Stephen Woodbridge Fri, 25 Jul 2008 10:45:10 -0700

Scott Hess wrote:
> Sorry for the long delay - I was out-of-town, and on return my email
> box had seemingly grown without bound.  So I've been working my LIFO
> way through giving each email it's five minutes of fame, and here we
> are...


Scott,

The long delay was well worth the information below and the new test 
functions. I hear I should not expect these to be around in the future 
and that hacking the low level code and structure is subject to change 
on any commit. This is good enough for now and I can use this for my 
initial analysis of my data. Luckily my data is relatively static.

An alternative approach that I thought of was to reparse all the 
documents through the stemmer and take the output tokens and add them 
and counts directly to a stats table. This approach might might be more 
stable in the long term, but is probably slower than using the debug 
routines, but I do not think this is a time critical step in the first 
place. You debug routines do provide some additional information that my 
approach does not provide.

I will definitely give your code a try.

Thanks,
   -Steve

> On Sat, Jul 5, 2008 at 9:30 PM, Stephen Woodbridge
> <[EMAIL PROTECTED]> wrote:
>> My particular use for FTS is not document text but addresses and it
>> would be very useful if there were a way to analyze the FTS index to get
>> statistics on the keys. I could then use this information to make a
>> custom parser/stemmer that could eliminate stop words.
>>
>> For example, Rd, road, st, street, etc would be overly represented and
>> not very discriminating, so these should/could be removed. Ideally this
>> list should be generated based on loading the data, the analyzing the
>> index, then updating the stemmer to remove the new stop works and again
>> analyzing and adjusting if needed.
>>
>> Is this possible? How?
>>
>> If I had to code this where would I start, I would like to get a list of
>> the keys and a count of how many rows that a given key is represented
>> in. I assume a token that is represented multiple times in a document is
>> represented by a list of offsets, so I can also get a count of the
>> number of time it show in each document somehow. I think I have figured
>> this much out by reading all the posts on FTS in the archive.
> 
> You're (maybe) in luck!  I recently added an optimize() function to
> fts3, which merges all segments into one.  As part of this, I added
> dump_terms() and dump_doclist() functions, only available in test
> mode, so that I could add tests that make assertions about the state
> of the fts3 index.  dump_terms() can be used to dump the terms from
> the fts3 index, while dump_doclist() can be used to dump the hits.
> It's probably not a perfect match for what you want to do, but you
> could probably start with something like:
> 
>   SELECT dump_terms(t) FROM t LIMIT 1;
> 
> This will generate a space-separated list of unique terms from t's
> index.  Not so useful if your tokenizer allows spaces in words, but
> this is an index-debugging function.  Given those words, you can pass
> each term into:
> 
>   SELECT dump_doclist(t, ?) FROM t LIMIT 1;
> 
> to get the doclist.  Depending on your schema, the doclist will look
> something like:
> 
>     [2 0[3] 1[2 9 14]] [12 1[1 4]]
> 
> The inner [N N N] sequences are the positions of hits in a column,
> there can be 1 or more, they should be in order.  The number before
> the inner [] is the column number in the row, there can be one for
> each column, if there are none this is a delete document (shouldn't
> see that case in the dump_doclist() call described above).  The number
> after the opening of the outer [] is the docid of the doclist.  There
> can be zero or more doclists.  The above has 1 hit for column 0 of
> docid 2, 3 hits for column 1 of docid 2, and 2 hits for column 1 of
> docid 12.
> 
> So, you could probably either compile fts3.c with SQLITE_TEST and hack
> something together in your code, or go in and hack your own function
> using the dump_terms() and dump_doclist() implementations as examples
> of how to walk the index.  It would be pretty quick and easy to write
> something to analyze an fts3 index using tclsqlite3, I think.
> 
> NOTE: These functions are for testing purposes, and are not intended
> for production use.  They scan the entire index, and zero effort has
> been spent on making them efficient in any way.  If I think of a
> different/better way to write the tests, I might just remove them.
> It's possible similar functionality might be exposed by a future
> iteration of fts, but these functions are not an early version of
> that.
> 
> -scott
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] FTS statistics and stemming

Reply via email to