[sqlite] [Fwd: FTS statistics and stemming]

Stephen Woodbridge Tue, 15 Jul 2008 19:48:34 -0700

Hello Scott et al,

I posted this to the list a while back but never got a response from 
anyone. It would be great if you could take 5 mins to read and respond 
to this. I have been reading through the fts3 code, but there is a lot 
of it and I'm not sure how it works from a big picture let along in any 
of its details. My assumption is that there are no stats at the moment, 
but I'm not opposed to writing some code the cycles through the indexes 
and tables to compute some stats for my needs or even the general case 
if I had some pointers on:


1) where are the words indexed
2) how can I cycle through the words in C to build some stats

pointers to existing code would be fine and I can focus reading that code.

So from reading through the various fts documents and posts and the code 
I think I understand the there is a blob with a structure something like:

word [doc_id, offset, offset, ...], [doc_id, offset, ...], ...

is this correct? where is this stored?

Best regards,
   -Stephen Woodbridge

-------- Original Message --------
Subject: [sqlite] FTS statistics and stemming
Date: Sat, 05 Jul 2008 23:30:55 -0500
From: Stephen Woodbridge <[EMAIL PROTECTED]>
Reply-To: General Discussion of SQLite Database <sqlite-users@sqlite.org>
To: General Discussion of SQLite Database <sqlite-users@sqlite.org>

Hi,

First let me say that FTS3 is really awesome. This is my first
experience playing with FTS and it works very nicely with the PORTER
stemming.

My particular use for FTS is not document text but addresses and it
would be very useful if there were a way to analyze the FTS index to get
statistics on the keys. I could then use this information to make a
custom parser/stemmer that could eliminate stop words.

For example, Rd, road, st, street, etc would be overly represented and
not very discriminating, so these should/could be removed. Ideally this
list should be generated based on loading the data, the analyzing the
index, then updating the stemmer to remove the new stop works and again
analyzing and adjusting if needed.

Is this possible? How?

If I had to code this where would I start, I would like to get a list of
the keys and a count of how many rows that a given key is represented
in. I assume a token that is represented multiple times in a document is
represented by a list of offsets, so I can also get a count of the
number of time it show in each document somehow. I think I have figured
this much out by reading all the posts on FTS in the archive.

Thanks,
    -Steve
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] [Fwd: FTS statistics and stemming]

Reply via email to