Re: transliterations (was Compelling Unicode demo)

Markus Scherer Mon, 19 Nov 2001 11:54:39 -0800

Lars Kristan wrote:

> Anyway, maybe I did a mistake by mixing the two aspects right from the
> start. If we forget about the ë for a moment and think about Björk or
> Almodóvar. The most basic transliteration would be dropping of all accents
> and I did not find that in the http://oss.software.ibm.com/cgi-bin/icu/tr
> demo, the closest thing I got was Almodo<'>var.



You can write a short ICU 2.0 transliteration ID that decomposes the input (NFD) and 
then removes accents. Mark knows the syntax better...

> I think people will expect that searching for Almodovar will find both
> forms. And that means people searching the web (ok, you can say those have
> time to repeat the search) as well as people working for example in a bank
> searching for an account.


This may be done better based on a locale-specific collator than based on 
transliteration. With a collator, you can base a search on only primary (letter-level) 
differences.


> Once simple transliteration is covered, adding some transcriptions as well
> would be a plus. Providing both Bjork and Bjoerk as entries in the index may
> not be neither always correct nor always complete, but - it's something,
> right?


You can do this in ICU with custom rules.


> To sum it up - I am was not thinking exact transcription or transliteration,
> with both source and target language defined. All I am saying is that
> something generic would be handy.


More generic than an almost regexp-style rules syntax and means to concatenate 
arbitrary transliterator objects?

markus

Re: transliterations (was Compelling Unicode demo)

Reply via email to