Alexey Pechnikov wrote: > Hello! > > And how about stemming? Can I'm using ispell, myspell, hunspell? And trigrams?
Hello Alexey, There are currently two stemmers available: simple - which just breaks the input into tokens based on spaces and punctuation that is first converted to spaces. porter - the porter stemmer which is based on English and collapses the extensions of many words to the base word, like: runs, runner, running to run (not sure if it works for this sequence but you get the idea). http://snowball.tartarus.org/algorithms/english/stemmer.html You can code additional stemmers and register them for use with the fts system. You might want to look at: http://snowball.tartarus.org/ which has additional stemmers written in snowball for many languages and these have C code that can be adapted for use with SQLite fts. I'm not familiar with the use of ispell, myspell, hunspell are all spelling checkers, I'm not sure if they have built-in stemmers that could be adapted. The idea behind the stemmer is to reduce variants of a word to its root, so that there is a higher probability of matching. For example if someone searches for the word "geocoder" it is reasonable to assume they would also be interested in documents with the word "geocode" or "geocoding". Trigrams I think are more valuable for fuzzy match but I do not think they can be used for FTS because of the way it works. For example if you want to search for "fuzy" and hope to find "fuzzy" using trigrams, I assume you would do something like this: fuzy => _fu, fuz, uzy, zy_ You would then search for all documents that had this set of trigrams, but a document with fuzzy in it would have: fuzzy => _fu, fuz, uzz, zzy, zy_ So while 3 of the trigrams match, you would not get a match for "uzy" so the document would not get selected. This is in fact where a spelling checker/correcter might be useful if it could check that fuzy was misspelled and correct it, but most spelling correction needs human interaction to validate the change and the interaction would need to be done at the application layer and probably not at the database layer. This also begs the question of searching for documents the have spelling errors in them. One idea I had with respect to fuzzy search using FTS and trigrams that if it could search documents that had some percentage of the trigrams like: MATCH 'fuzy' AT 75% (3 of 4 trigrams) MATCH 'fuzy' AT 50% (2 of 4 trigrams) then it would find all documents that had 3 of the 4 trigrams present and rank the results based on number of matched trigrams. Obviously for larger words or multiple words this would work better. This would provide some for a fuzzy search capability given the existing FTS machinery and the cost of significantly more index entries and may be additional search time because of the additional entries. This would also require some additional post processing to filter out documents that were selected because trigrams from multiple unrelated words matched your selection set. I'm not convinced this would work in reality because bigram/trigram comparison works better for fuzzy match of words and the pollution of additional terms from other words in a document may not make it useful. I have thought a lot about these issues and would appreciate any thoughts or ideas on how to implement any of these concepts or others for fuzzy searching and matching. -Steve _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users