I missed Mark's change in subject - so I replied to Marcin's message right now under the old subject line:

----- Original Message -----
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, July 10, 2004 01:02
Subject: Re: Looking for transcription or transliteration standards
latin->arabic


> W liście z pią, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisał: > > > o-slash, can be analyzed as o and slash, even though that's not done > > canonically in Unicode. Allowing users outside Scandinavia to perform > > fuzzy searches for words with this character is useful. > > > > In this view of folding, Language-specific fuzzy searches would be tailored > > (usually by being based on collation information, rather than on generic > > diacritic folding). > > In Polish letters with diacritics ĄĆĘŁŃÓŚŹŻ are sorted after the > corresponding letters without. Omitting diacritics is an error, even > though text without them is generally readable. They are removed when > the given protocol requires or encourages ASCII (e.g. filenames to be > used in URLs, login names, variable names in programming languages, > ancient computer systems). There is no alternate spelling scheme like > German AE/OE/UE/SS. > > Polish leters are never folded when sorting lexicographically. This > applies to Ł in the same way as to other eight letters. Foreign > diacritics are always folded though, at least I don't remember seeing > any other case. I think Ó would be folded together with O in an > encyclopaedia if this is a foreign O with some accent, unrelated to > Polish Ó which is a separate letter (can you suggest some non-Polish > word starting with Ó which could be found in an encyclopaedia?). > > But there are cases when I would prefer to fold Polish diacritics in > searches. > > It's basically every case when you are not sure that all stored data is > using diacritics, for example in generic WWW searching. There are still > people who don't use diacritics in usenet and email, or in entries in > guest books and other "unprofessional" web content. There are even > sometimes people who insist that Polish letters *should not* be used in > usenet and email because some computer systems can't handle them. > Diacritics are rare on IRC (because the IRC protocol doesn't distinguish > between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers > (because of laziness). This is why for searching archives of unknown > data it's generally better to fold them. > > As far as I know, the default UCA folds these letters except Ł, and > standard Polish tailoring doesn't fold any Polish letter. While not > folding them in searching is technically correct and nobody would be > surprised that they are not folded, it's often more useful to fold them > and people would be pleasantly surprised if they don't have to repeat > the search with omitted diacritics. > > If one wants to find data containing a word, rather than collect > statistics about usage of a word with and without diacritics, it's very > rare than folding does some harm. > > Hmm, it's not that simple. When I'm searching for JĘZYK (existing word), > I will be happy to find occurrences of JEZYK too (non-existing word, > must have had diacritics stripped), but it makes no sense to return > JEŻYK (another existing word). It's not just making the letters > equivalent. >






Reply via email to