"the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration
will miss results."
Exactly, that's my problem, searching on a different alphabet than the one
on which it was indexed a document.
François, thank you for your help. Have you used the new ICU Filters? Do
they work OK? (I know it doesn't do Kanji)

Tomás

2011/3/11 François Schiettecatte <fschietteca...@gmail.com>

> Good question about transliteration, the issue has to do with recall, for
> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana
> respectively), not doing the transliteration will miss results. You will
> find that the big search engines do the transliteration for you
> automatically. This issue get even more complicated when you dig into
> orthographic variation because Japanese orthography is very variable (ie
> there is more than one way to write a 'word'), as is tokenization (ie there
> is more than one way to tokenize it), see:
>
>        http://www.cjk.org/cjk/reference/japvar.htm
>
> I have used the Basis Technology software in the past, it is very good, but
> it is also very expensive.
>
> François
>
> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote:
>
> > Why not index it as-is? Solr can handle Unicode.
> >
> > Transliterating hiragana to katakana is a very weird idea. I cannot
> imagine how that would help.
> >
> > You will need some sort of tokenization to find word boundaries. N-grams
> work OK for search, but are really ugly for highlighting.
> >
> > As far as I know, there are no good-quality free tokenizers for Japanese.
> Basis Technology sells Japanese support that works with Lucene and Solr.
> >
> > wunder
> >
> > On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:
> >
> >> Tomás
> >>
> >> That wont really work, transliteration to Romaji works for individual
> terms only so you would need to tokenize the Japanese prior to
> transliteration. I am not sure what tool you plan to use for
> transliteration, I have used ICU in the past and from what I can tell it
> does not transliterates Kanji. Besides transliterating Kanji is debatable
> for a variety of reasons.
> >>
> >> What I would suggest is that you transliterate Hiragana to Katakana,
> leave the Kanji alone, and index/search using ngrams. If you want 'proper'
> tokenization I would recommend Mecab.
> >>
> >> I have looked into this for a client and there is no clear cut solution.
> >>
> >> Cheers
> >>
> >> François
> >>
> >>
> >> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
> >>
> >>> This question is probably not a completely Solr question but it's
> related to
> >>> it. I'm dealing with a Japanese Solr application in which I would like
> to be
> >>> able to search in any of the Japanese Alphabets. The content can also
> be in
> >>> any Japanese Alphabet. I've been thinking in this solution: Convert
> >>> everything to roma-ji, on Index time and query time.
> >>> For example:
> >>>
> >>> Indexing time:
> >>> [Something in Hiragana] --> translate it to roma-ji --> index
> >>>
> >>> Searching time:
> >>> [Something in Katakana] --> translate it to roma-ji --> search
> >>> or
> >>> [Something in Kanji] --> translate it to roma-ji --> search
> >>>
> >>> I don't have a deep understanding of Japanese, and that's my problem.
> Did
> >>> somebody in the list tried something like this before? Did it work?
> >>>
> >>>
> >>> Thanks,
> >>>
> >>> Tomás
> >>
> >
> > --
> > Walter Underwood
> > Venture ASM, Troop 14, Palo Alto
> >
> >
> >
>
>

Reply via email to