Re: Multiple Japanese Alphabets in Solr

François Schiettecatte Fri, 11 Mar 2011 08:10:05 -0800

Tomás

That wont really work, transliteration to Romaji works for individual terms 
only so you would need to tokenize the Japanese prior to transliteration. I am 
not sure what tool you plan to use for transliteration, I have used ICU in the 
past and from what I can tell it does not transliterates Kanji. Besides 
transliterating Kanji is debatable for a variety of reasons.


What I would suggest is that you transliterate Hiragana to Katakana, leave the 
Kanji alone, and index/search using ngrams. If you want 'proper' tokenization I 
would recommend Mecab.

I have looked into this for a client and there is no clear cut solution.

Cheers

François


On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:

> This question is probably not a completely Solr question but it's related to
> it. I'm dealing with a Japanese Solr application in which I would like to be
> able to search in any of the Japanese Alphabets. The content can also be in
> any Japanese Alphabet. I've been thinking in this solution: Convert
> everything to roma-ji, on Index time and query time.
> For example:
> 
> Indexing time:
> [Something in Hiragana] --> translate it to roma-ji --> index
> 
> Searching time:
> [Something in Katakana] --> translate it to roma-ji --> search
> or
> [Something in Kanji] --> translate it to roma-ji --> search
> 
> I don't have a deep understanding of Japanese, and that's my problem. Did
> somebody in the list tried something like this before? Did it work?
> 
> 
> Thanks,
> 
> Tomás

Re: Multiple Japanese Alphabets in Solr

Reply via email to