Great, let me know when you have the modifications so I can stress test them.
regards, Horacio On Thu, May 5, 2011 at 1:56 AM, William Morgan <wmorgan-...@masanjin.net> wrote: > Hi Horacio, > > Thanks for all your help so far. > > Reformatted excerpts from Horacio Sanson's message of 2011-05-04: >> After some hacking I got a Heliotrope server that works perfectly with >> Japanese text. All I did was follow your comments >> and applied the MeCab tokenizer to the message body and query strings >> before passing them to Whistelpig or more specific >> to Heliotrope::Index. > > Great! > >> There is one problem I don't see how to handle... I do receive email >> in Japanese but also Chinese and Korean. I need a different >> tokenizer for each one and I have no idea how to handle this. Do email >> messages contain a language header that would allow me >> to identify the language and pass it to the corresponding tokenizer?? > > There's not a great way to do this in email. You can look at the > content-type headers, which is sometimes present, and that will > sometimes give you a clue. But it's usually useless. > > You can write some heuristics by hand, of course. Or you can try naive > bayes, which performs pretty well on this type of task. It looks like > someone just started a ruby project here: https://github.com/fela/rlid. > It seems to only have Eurpoean languages so far, but you can probably > just dump in some CKJ text and retrain. > > As for your patches: I've applied a related patch to fix the encoding > issue with Query#parsed_query_s. Can you let me know if that works? > > Rather than sticking mecab directly in heliotrope, I am going to make a > hook for users to plug in their own custom tokenization code like you're > doing. > -- > William <wmorgan-...@masanjin.net> > _______________________________________________ > Sup-devel mailing list > Sup-devel@rubyforge.org > http://rubyforge.org/mailman/listinfo/sup-devel > _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel