[sqlite] contribution: fts3 porter stemmer enhancements to handle common european accents

James Berry Wed, 27 Jan 2010 19:54:13 -0800

I'd like to contribute for potential inclusion, or to help out others in the 
community, a small set of enhancements I've made to the porter tokenizer. This 
implementation shares most of its code with the current porter tokenizer, as 
the changes are really just in the tokenizer prior to the stemming operation. 
This small patch implements an additional tokenizer, which I am calling 
"porterPlus", for lack of further inspiration.


The code is based on several observations made while attempting to use the 
current porter tokenizer on a common english/utf-8 dataset:

 - There are a limited number of accented characters common in english text.

 - If the accents simply weren't there, the words would be stemmed 
appropriately, but the porter stemmer gives up on a word when it sees any utf-8 
characters, leading to perceived failures in the search queries.

 - The porter stemmer, by its very nature, is not intended to work for 
non-english text, so we can write off the major part of the the utf-8 character 
set, while concentrating on major improvements to those characters involved in 
common european languages, particularly those that have been adopted into 
english usage.

 - Additionally, there are a number of punctuation characters commonly rendered 
in utf-8 that are missed by the regular porter tokenizer  (hyphen and 
typographic quotes are good examples).

This small patch does the following:

        - Defines a new tokenizer "porterPlus" which shares most of its code 
with the regular porter tokenizer

        - Identifies a small subset of utf-8 characters for special handling. 
In the case of common accented varieties of regular ascii characters, the 
accents are dropped, leaving the unaccented character only. For instance, sauté 
is converted to saute. The resultant word is passed as usual into the porter 
stemmer.

        - Also identifies a small subset of utf-8 characters to treat as 
delimiters, as they would otherwise be treated as part of another token, 
leading to search failures. (hyphen, typographic quotes, etc).

In our use so far, these small changes have meant that we now normalize away 
all of the important utf-8 characters in our input text, which gives us 100% 
searchability of significant input tokens.

The patch (to the 3.6.22 amalgamation) is attached.

James

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] contribution: fts3 porter stemmer enhancements to handle common european accents

Reply via email to