Sorry for the long delay - I was out-of-town, and on return my email box had seemingly grown without bound. So I've been working my LIFO way through giving each email it's five minutes of fame, and here we are...
On Sat, Jul 5, 2008 at 9:30 PM, Stephen Woodbridge <[EMAIL PROTECTED]> wrote: > My particular use for FTS is not document text but addresses and it > would be very useful if there were a way to analyze the FTS index to get > statistics on the keys. I could then use this information to make a > custom parser/stemmer that could eliminate stop words. > > For example, Rd, road, st, street, etc would be overly represented and > not very discriminating, so these should/could be removed. Ideally this > list should be generated based on loading the data, the analyzing the > index, then updating the stemmer to remove the new stop works and again > analyzing and adjusting if needed. > > Is this possible? How? > > If I had to code this where would I start, I would like to get a list of > the keys and a count of how many rows that a given key is represented > in. I assume a token that is represented multiple times in a document is > represented by a list of offsets, so I can also get a count of the > number of time it show in each document somehow. I think I have figured > this much out by reading all the posts on FTS in the archive. You're (maybe) in luck! I recently added an optimize() function to fts3, which merges all segments into one. As part of this, I added dump_terms() and dump_doclist() functions, only available in test mode, so that I could add tests that make assertions about the state of the fts3 index. dump_terms() can be used to dump the terms from the fts3 index, while dump_doclist() can be used to dump the hits. It's probably not a perfect match for what you want to do, but you could probably start with something like: SELECT dump_terms(t) FROM t LIMIT 1; This will generate a space-separated list of unique terms from t's index. Not so useful if your tokenizer allows spaces in words, but this is an index-debugging function. Given those words, you can pass each term into: SELECT dump_doclist(t, ?) FROM t LIMIT 1; to get the doclist. Depending on your schema, the doclist will look something like: [2 0[3] 1[2 9 14]] [12 1[1 4]] The inner [N N N] sequences are the positions of hits in a column, there can be 1 or more, they should be in order. The number before the inner [] is the column number in the row, there can be one for each column, if there are none this is a delete document (shouldn't see that case in the dump_doclist() call described above). The number after the opening of the outer [] is the docid of the doclist. There can be zero or more doclists. The above has 1 hit for column 0 of docid 2, 3 hits for column 1 of docid 2, and 2 hits for column 1 of docid 12. So, you could probably either compile fts3.c with SQLITE_TEST and hack something together in your code, or go in and hack your own function using the dump_terms() and dump_doclist() implementations as examples of how to walk the index. It would be pretty quick and easy to write something to analyze an fts3 index using tclsqlite3, I think. NOTE: These functions are for testing purposes, and are not intended for production use. They scan the entire index, and zero effort has been spent on making them efficient in any way. If I think of a different/better way to write the tests, I might just remove them. It's possible similar functionality might be exposed by a future iteration of fts, but these functions are not an early version of that. -scott _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users