Re: Looking for transcription or transliteration standards latin- >arabic

Marcin 'Qrczak' Kowalczyk Sat, 10 Jul 2004 01:34:00 -0700

W liÅcie z piÄ, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisaÅ:


> o-slash, can be analyzed as o and slash, even though that's not done 
> canonically in Unicode. Allowing users outside Scandinavia to perform 
> fuzzy  searches for words with this character is useful.
> 
> In this view of folding, Language-specific fuzzy searches would be tailored 
> (usually by being based on collation information, rather than on generic 
> diacritic folding).

In Polish letters with diacritics ÄÄÄÅÅÃÅÅÅ are sorted after the
corresponding letters without. Omitting diacritics is an error, even
though text without them is generally readable. They are removed when
the given protocol requires or encourages ASCII (e.g. filenames to be
used in URLs, login names, variable names in programming languages,
ancient computer systems). There is no alternate spelling scheme like
German AE/OE/UE/SS.

Polish leters are never folded when sorting lexicographically. This
applies to Å in the same way as to other eight letters. Foreign
diacritics are always folded though, at least I don't remember seeing
any other case. I think Ã would be folded together with O in an
encyclopaedia if this is a foreign O with some accent, unrelated to
Polish Ã which is a separate letter (can you suggest some non-Polish
word starting with Ã which could be found in an encyclopaedia?).

But there are cases when I would prefer to fold Polish diacritics in
searches.

It's basically every case when you are not sure that all stored data is
using diacritics, for example in generic WWW searching. There are still
people who don't use diacritics in usenet and email, or in entries in
guest books and other "unprofessional" web content. There are even
sometimes people who insist that Polish letters *should not* be used in
usenet and email because some computer systems can't handle them.
Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
(because of laziness). This is why for searching archives of unknown
data it's generally better to fold them.

As far as I know, the default UCA folds these letters except Å, and
standard Polish tailoring doesn't fold any Polish letter. While not
folding them in searching is technically correct and nobody would be
surprised that they are not folded, it's often more useful to fold them
and people would be pleasantly surprised if they don't have to repeat
the search with omitted diacritics.

If one wants to find data containing a word, rather than collect
statistics about usage of a word with and without diacritics, it's very
rare than folding does some harm.

Hmm, it's not that simple. When I'm searching for JÄZYK (existing word),
I will be happy to find occurrences of JEZYK too (non-existing word,
must have had diacritics stripped), but it makes no sense to return
JEÅYK (another existing word). It's not just making the letters
equivalent.

-- 
   __("<         Marcin Kowalczyk
   \__/       [EMAIL PROTECTED]
    ^^     http://qrnik.knm.org.pl/~qrczak/

Re: Looking for transcription or transliteration standards latin- >arabic

Reply via email to