Accent transforms are language-specific, so an accent filter
should take an ISO langauge code as an argument.

Some examples:

* In French and English, a diereses is a hint to pronounce neighboring
vowels separateley, as in coöp, naïve, or Noël.

* In German, ü transformes to ue.

* In Swedish, ö is a different letter than o, and should
not be transformed. The same is true for ø in Danish and
Norwegian.

* Then there is Motörhead and Motley Crüe, see:
http://en.wikipedia.org/wiki/Heavy_metal_umlaut

* I don't know of an ISO language code for Tolkein's
Elvish, so we're out of luck for Manwë.

Another approach would be to generate the accent-transformed
terms as synonyms at the same token position. Then you could
generate multiple options.

Obviously, we had to do this right for Ultraseek a few years ago.

wunder

On 9/27/07 9:13 AM, "Steven Rowe" <[EMAIL PROTECTED]> wrote:

> Maybe there should be an option on ISOLatin1TokenFilter to use German
> substitutions, in addition to the current behavior of simply stripping
> diacritics?
> 
> Does anyone know if there are other (Latin-1-utilizing) languages
> besides German with standardized diacritic substitutions that involve
> something other than just stripping the diacritics?

Reply via email to