Re: [sqlite] FTS statistics and stemming

Stephen Woodbridge Sat, 26 Jul 2008 09:37:14 -0700

Alexey Pechnikov wrote:
> Hello!
> 
> And how about stemming? Can I'm using ispell, myspell, hunspell? And trigrams?


Hello Alexey,

There are currently two stemmers available:

simple - which just breaks the input into tokens based on spaces and 
punctuation that is first converted to spaces.

porter - the porter stemmer which is based on English and collapses the 
extensions of many words to the base word, like: runs, runner, running 
to run (not sure if it works for this sequence but you get the idea).
http://snowball.tartarus.org/algorithms/english/stemmer.html

You can code additional stemmers and register them for use with the fts 
system. You might want to look at:
http://snowball.tartarus.org/
which has additional stemmers written in snowball for many languages and 
these have C code that can be adapted for use with SQLite fts.

I'm not familiar with the use of ispell, myspell, hunspell are all 
spelling checkers, I'm not sure if they have built-in stemmers that 
could be adapted. The idea behind the stemmer is to reduce variants of a 
word to its root, so that there is a higher probability of matching. For 
example if someone searches for the word "geocoder" it is reasonable to 
assume they would also be interested in documents with the word 
"geocode" or "geocoding".

Trigrams I think are more valuable for fuzzy match but I do not think 
they can be used for FTS because of the way it works. For example if you 
want to search for "fuzy" and hope to find "fuzzy" using trigrams, I 
assume you would do something like this:

fuzy => _fu, fuz, uzy, zy_

You would then search for all documents that had this set of trigrams, 
but a document with fuzzy in it would have:

fuzzy => _fu, fuz, uzz, zzy, zy_

So while 3 of the trigrams match, you would not get a match for "uzy" so 
the document would not get selected.

This is in fact where a spelling checker/correcter might be useful if it 
could check that fuzy was misspelled and correct it, but most spelling 
correction needs human interaction to validate the change and the 
interaction would need to be done at the application layer and probably 
not at the database layer. This also begs the question of searching for 
documents the have spelling errors in them.

One idea I had with respect to fuzzy search using FTS and trigrams that 
if it could search documents that had some percentage of the trigrams like:

MATCH 'fuzy' AT 75%  (3 of 4 trigrams)
MATCH 'fuzy' AT 50%  (2 of 4 trigrams)

then it would find all documents that had 3 of the 4 trigrams present 
and rank the results based on number of matched trigrams. Obviously for 
larger words or multiple words this would work better.

This would provide some for a fuzzy search capability given the existing 
FTS machinery and the cost of significantly more index entries and may 
be additional search time because of the additional entries. This would 
also require some additional post processing to filter out documents 
that were selected because trigrams from multiple unrelated words 
matched your selection set. I'm not convinced this would work in reality 
because bigram/trigram comparison works better for fuzzy match of words 
and the pollution of additional terms from other words in a document may 
not make it useful.

I have thought a lot about these issues and would appreciate any 
thoughts or ideas on how to implement any of these concepts or others 
for fuzzy searching and matching.

-Steve
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] FTS statistics and stemming

Reply via email to