Re: Best way to index without diacritics

Walter Underwood Wed, 13 Aug 2008 10:03:09 -0700

Stripping accents doesn't quite work. The correct translation
is language-dependent. In German, o-dieresis should turn into
"oe", but in English, it shoulde be "o" (as in "coöperate" or
"Mötley Crüe"). In Swedish, it should not be converted at all.


There are other character-to-string conversions: ae-ligature
to "ae", "ß" to "ss", and so on. Luckily, those are independent
of language.

wunder

On 8/13/08 9:16 AM, "Steven A Rowe" <[EMAIL PROTECTED]> wrote:

> Hi Norberto,
> 
> https://issues.apache.org/jira/browse/LUCENE-1343
> 
> :)
> 
> Steve
> 
> On 08/13/2008 at 12:35 AM, Norberto Meijome wrote:
>> On Tue, 12 Aug 2008 11:44:42 -0400
>> "Steven A Rowe" <[EMAIL PROTECTED]> wrote:
>> 
>>> Solr is Unicode aware.  The ISOLatin1AccentFilterFactory
>> handles diacritics for the ISO Latin-1 section of the Unicode
>> character set.  UTF (do you mean UTF-8?) is a (set of)
>> Unicode serialization(s), and once Solr has deserialized it,
>> it is just Unicode characters (Java's in-memory UTF-16
>> representation).
>>> 
>>> So as long as you're only concerned about removing
>> diacritics from the set of Unicode characters that overlaps
>> ISO Latin-1, and not about other Unicode characters, then
>> ISOLatin1AccentFilterFactory should work for you.
>> 
>> hi,
>> do you know if anyone has implemented a similar filter using
>> icu and mapping (a lot more of) UTF-8 to ascii ?
>> 
>> B
>> 
>> _________________________
>> {Beto|Norberto|Numard} Meijome

Re: Best way to index without diacritics

Reply via email to