Lars Kristan wrote: > Anyway, maybe I did a mistake by mixing the two aspects right from the > start. If we forget about the ë for a moment and think about Björk or > Almodóvar. The most basic transliteration would be dropping of all accents > and I did not find that in the http://oss.software.ibm.com/cgi-bin/icu/tr > demo, the closest thing I got was Almodo<'>var.
You can write a short ICU 2.0 transliteration ID that decomposes the input (NFD) and then removes accents. Mark knows the syntax better... > I think people will expect that searching for Almodovar will find both > forms. And that means people searching the web (ok, you can say those have > time to repeat the search) as well as people working for example in a bank > searching for an account. This may be done better based on a locale-specific collator than based on transliteration. With a collator, you can base a search on only primary (letter-level) differences. > Once simple transliteration is covered, adding some transcriptions as well > would be a plus. Providing both Bjork and Bjoerk as entries in the index may > not be neither always correct nor always complete, but - it's something, > right? You can do this in ICU with custom rules. > To sum it up - I am was not thinking exact transcription or transliteration, > with both source and target language defined. All I am saying is that > something generic would be handy. More generic than an almost regexp-style rules syntax and means to concatenate arbitrary transliterator objects? markus

