"the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration will miss results." Exactly, that's my problem, searching on a different alphabet than the one on which it was indexed a document. François, thank you for your help. Have you used the new ICU Filters? Do they work OK? (I know it doesn't do Kanji)
Tomás 2011/3/11 François Schiettecatte <fschietteca...@gmail.com> > Good question about transliteration, the issue has to do with recall, for > example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana > respectively), not doing the transliteration will miss results. You will > find that the big search engines do the transliteration for you > automatically. This issue get even more complicated when you dig into > orthographic variation because Japanese orthography is very variable (ie > there is more than one way to write a 'word'), as is tokenization (ie there > is more than one way to tokenize it), see: > > http://www.cjk.org/cjk/reference/japvar.htm > > I have used the Basis Technology software in the past, it is very good, but > it is also very expensive. > > François > > On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote: > > > Why not index it as-is? Solr can handle Unicode. > > > > Transliterating hiragana to katakana is a very weird idea. I cannot > imagine how that would help. > > > > You will need some sort of tokenization to find word boundaries. N-grams > work OK for search, but are really ugly for highlighting. > > > > As far as I know, there are no good-quality free tokenizers for Japanese. > Basis Technology sells Japanese support that works with Lucene and Solr. > > > > wunder > > > > On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote: > > > >> Tomás > >> > >> That wont really work, transliteration to Romaji works for individual > terms only so you would need to tokenize the Japanese prior to > transliteration. I am not sure what tool you plan to use for > transliteration, I have used ICU in the past and from what I can tell it > does not transliterates Kanji. Besides transliterating Kanji is debatable > for a variety of reasons. > >> > >> What I would suggest is that you transliterate Hiragana to Katakana, > leave the Kanji alone, and index/search using ngrams. If you want 'proper' > tokenization I would recommend Mecab. > >> > >> I have looked into this for a client and there is no clear cut solution. > >> > >> Cheers > >> > >> François > >> > >> > >> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote: > >> > >>> This question is probably not a completely Solr question but it's > related to > >>> it. I'm dealing with a Japanese Solr application in which I would like > to be > >>> able to search in any of the Japanese Alphabets. The content can also > be in > >>> any Japanese Alphabet. I've been thinking in this solution: Convert > >>> everything to roma-ji, on Index time and query time. > >>> For example: > >>> > >>> Indexing time: > >>> [Something in Hiragana] --> translate it to roma-ji --> index > >>> > >>> Searching time: > >>> [Something in Katakana] --> translate it to roma-ji --> search > >>> or > >>> [Something in Kanji] --> translate it to roma-ji --> search > >>> > >>> I don't have a deep understanding of Japanese, and that's my problem. > Did > >>> somebody in the list tried something like this before? Did it work? > >>> > >>> > >>> Thanks, > >>> > >>> Tomás > >> > > > > -- > > Walter Underwood > > Venture ASM, Troop 14, Palo Alto > > > > > > > >