Good question about transliteration, the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration will miss results. You will find that the big search engines do the transliteration for you automatically. This issue get even more complicated when you dig into orthographic variation because Japanese orthography is very variable (ie there is more than one way to write a 'word'), as is tokenization (ie there is more than one way to tokenize it), see:
http://www.cjk.org/cjk/reference/japvar.htm I have used the Basis Technology software in the past, it is very good, but it is also very expensive. François On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote: > Why not index it as-is? Solr can handle Unicode. > > Transliterating hiragana to katakana is a very weird idea. I cannot imagine > how that would help. > > You will need some sort of tokenization to find word boundaries. N-grams work > OK for search, but are really ugly for highlighting. > > As far as I know, there are no good-quality free tokenizers for Japanese. > Basis Technology sells Japanese support that works with Lucene and Solr. > > wunder > > On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote: > >> Tomás >> >> That wont really work, transliteration to Romaji works for individual terms >> only so you would need to tokenize the Japanese prior to transliteration. I >> am not sure what tool you plan to use for transliteration, I have used ICU >> in the past and from what I can tell it does not transliterates Kanji. >> Besides transliterating Kanji is debatable for a variety of reasons. >> >> What I would suggest is that you transliterate Hiragana to Katakana, leave >> the Kanji alone, and index/search using ngrams. If you want 'proper' >> tokenization I would recommend Mecab. >> >> I have looked into this for a client and there is no clear cut solution. >> >> Cheers >> >> François >> >> >> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote: >> >>> This question is probably not a completely Solr question but it's related to >>> it. I'm dealing with a Japanese Solr application in which I would like to be >>> able to search in any of the Japanese Alphabets. The content can also be in >>> any Japanese Alphabet. I've been thinking in this solution: Convert >>> everything to roma-ji, on Index time and query time. >>> For example: >>> >>> Indexing time: >>> [Something in Hiragana] --> translate it to roma-ji --> index >>> >>> Searching time: >>> [Something in Katakana] --> translate it to roma-ji --> search >>> or >>> [Something in Kanji] --> translate it to roma-ji --> search >>> >>> I don't have a deep understanding of Japanese, and that's my problem. Did >>> somebody in the list tried something like this before? Did it work? >>> >>> >>> Thanks, >>> >>> Tomás >> > > -- > Walter Underwood > Venture ASM, Troop 14, Palo Alto > > >