I'd like to contribute for potential inclusion, or to help out others in the community, a small set of enhancements I've made to the porter tokenizer. This implementation shares most of its code with the current porter tokenizer, as the changes are really just in the tokenizer prior to the stemming operation. This small patch implements an additional tokenizer, which I am calling "porterPlus", for lack of further inspiration.
The code is based on several observations made while attempting to use the current porter tokenizer on a common english/utf-8 dataset: - There are a limited number of accented characters common in english text. - If the accents simply weren't there, the words would be stemmed appropriately, but the porter stemmer gives up on a word when it sees any utf-8 characters, leading to perceived failures in the search queries. - The porter stemmer, by its very nature, is not intended to work for non-english text, so we can write off the major part of the the utf-8 character set, while concentrating on major improvements to those characters involved in common european languages, particularly those that have been adopted into english usage. - Additionally, there are a number of punctuation characters commonly rendered in utf-8 that are missed by the regular porter tokenizer (hyphen and typographic quotes are good examples). This small patch does the following: - Defines a new tokenizer "porterPlus" which shares most of its code with the regular porter tokenizer - Identifies a small subset of utf-8 characters for special handling. In the case of common accented varieties of regular ascii characters, the accents are dropped, leaving the unaccented character only. For instance, sauté is converted to saute. The resultant word is passed as usual into the porter stemmer. - Also identifies a small subset of utf-8 characters to treat as delimiters, as they would otherwise be treated as part of another token, leading to search failures. (hyphen, typographic quotes, etc). In our use so far, these small changes have meant that we now normalize away all of the important utf-8 characters in our input text, which gives us 100% searchability of significant input tokens. The patch (to the 3.6.22 amalgamation) is attached. James
_______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users