Forgot to mention you need the mecab ruby gem. In Ubuntu 10.04 this gem is part of the distribution and can be installed with the command:
sudo apt-get install libmecab-ruby1.8 libmecab-ruby1.9.1 mecab-ipadic-utf8 regards Horacio On Wed, May 4, 2011 at 10:42 AM, Horacio Sanson <hsan...@gmail.com> wrote: > Chasen is the worst tokenizer, is pretty old. The best one is MeCab > that is the faster and from the same author of Chasen. > You can see all major Japanese tokenizer in action at this URL: > http://nomadscafe.jp/test/keitaiso/index.cgi. Just put some > text in the box and press the button. > > After some hacking I got a Heliotrope server that works perfectly with > Japanese text. All I did was follow your comments > and applied the MeCab tokenizer to the message body and query strings > before passing them to Whistelpig or more specific > to Heliotrope::Index. > > There is one problem I don't see how to handle... I do receive email > in Japanese but also Chinese and Korean. I need a different > tokenizer for each one and I have no idea how to handle this. Do email > messages contain a language header that would allow me > to identify the language and pass it to the corresponding tokenizer?? > > > regards, > Horacio > > On Wed, May 4, 2011 at 7:26 AM, William Morgan <wmorgan-...@masanjin.net> > wrote: >> Reformatted excerpts from Horacio Sanson's message of 2011-05-03: >>> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60> >>> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328> >>> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328> >>> docid1 = index.add_entry entry1 => 1 >>> q1 = Query.new "body", "研究" => body:"研究" >>> results1 = index.search q1 => [] >> >> The problem here is tokenization. Whistlepig only provides a very simple >> tokenizer, namely, it looks for space-separated things [1]. So you have to >> space-separate your tokens in both the indexing and querying stages, e.g.: >> >> entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c> >> docid1 = index.add_entry entry1 => 1 >> q1 = Query.new "body", "研 究" => AND body:"研" body:"究" >> q1 = Query.new "body", "\"研 究\"" => PHRASE body:"研" body:"究" >> results1 = index.search q1 => [1] >> >> For Japanese, proper tokenization is tricky. You could simply space-separate >> every character and deal with the spurious matches across word boundaries. >> Or you could do it right by plugging in a proper tokenizer, e.g. something >> like http://www.chasen.org/~taku/software/TinySegmenter/. >> >> [1] It also strips any prefix or suffix characters that match [:punct:]. This >> is all pretty ad-hoc and undocumented. Providing simpler whitespace-only >> tokenizer as an alternative is in the works. >> -- >> William <wmorgan-...@masanjin.net> >> _______________________________________________ >> Sup-devel mailing list >> Sup-devel@rubyforge.org >> http://rubyforge.org/mailman/listinfo/sup-devel >> > _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel