Hi Horacio, Thanks for all your help so far.
Reformatted excerpts from Horacio Sanson's message of 2011-05-04: > After some hacking I got a Heliotrope server that works perfectly with > Japanese text. All I did was follow your comments > and applied the MeCab tokenizer to the message body and query strings > before passing them to Whistelpig or more specific > to Heliotrope::Index. Great! > There is one problem I don't see how to handle... I do receive email > in Japanese but also Chinese and Korean. I need a different > tokenizer for each one and I have no idea how to handle this. Do email > messages contain a language header that would allow me > to identify the language and pass it to the corresponding tokenizer?? There's not a great way to do this in email. You can look at the content-type headers, which is sometimes present, and that will sometimes give you a clue. But it's usually useless. You can write some heuristics by hand, of course. Or you can try naive bayes, which performs pretty well on this type of task. It looks like someone just started a ruby project here: https://github.com/fela/rlid. It seems to only have Eurpoean languages so far, but you can probably just dump in some CKJ text and retrain. As for your patches: I've applied a related patch to fix the encoding issue with Query#parsed_query_s. Can you let me know if that works? Rather than sticking mecab directly in heliotrope, I am going to make a hook for users to plug in their own custom tokenization code like you're doing. -- William <wmorgan-...@masanjin.net> _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel