Hello Scott et al, I posted this to the list a while back but never got a response from anyone. It would be great if you could take 5 mins to read and respond to this. I have been reading through the fts3 code, but there is a lot of it and I'm not sure how it works from a big picture let along in any of its details. My assumption is that there are no stats at the moment, but I'm not opposed to writing some code the cycles through the indexes and tables to compute some stats for my needs or even the general case if I had some pointers on:
1) where are the words indexed 2) how can I cycle through the words in C to build some stats pointers to existing code would be fine and I can focus reading that code. So from reading through the various fts documents and posts and the code I think I understand the there is a blob with a structure something like: word [doc_id, offset, offset, ...], [doc_id, offset, ...], ... is this correct? where is this stored? Best regards, -Stephen Woodbridge -------- Original Message -------- Subject: [sqlite] FTS statistics and stemming Date: Sat, 05 Jul 2008 23:30:55 -0500 From: Stephen Woodbridge <[EMAIL PROTECTED]> Reply-To: General Discussion of SQLite Database <sqlite-users@sqlite.org> To: General Discussion of SQLite Database <sqlite-users@sqlite.org> Hi, First let me say that FTS3 is really awesome. This is my first experience playing with FTS and it works very nicely with the PORTER stemming. My particular use for FTS is not document text but addresses and it would be very useful if there were a way to analyze the FTS index to get statistics on the keys. I could then use this information to make a custom parser/stemmer that could eliminate stop words. For example, Rd, road, st, street, etc would be overly represented and not very discriminating, so these should/could be removed. Ideally this list should be generated based on loading the data, the analyzing the index, then updating the stemmer to remove the new stop works and again analyzing and adjusting if needed. Is this possible? How? If I had to code this where would I start, I would like to get a list of the keys and a count of how many rows that a given key is represented in. I assume a token that is represented multiple times in a document is represented by a list of offsets, so I can also get a count of the number of time it show in each document somehow. I think I have figured this much out by reading all the posts on FTS in the archive. Thanks, -Steve _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users