----- Original Message ----- From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, July 10, 2004 01:02 Subject: Re: Looking for transcription or transliteration standards latin->arabic
> W liście z pią, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisał: > > > o-slash, can be analyzed as o and slash, even though that's not done > > canonically in Unicode. Allowing users outside Scandinavia to perform > > fuzzy searches for words with this character is useful. > > > > In this view of folding, Language-specific fuzzy searches would be tailored > > (usually by being based on collation information, rather than on generic > > diacritic folding). > > In Polish letters with diacritics ĄĆĘŁŃÓŚŹŻ are sorted after the > corresponding letters without. Omitting diacritics is an error, even > though text without them is generally readable. They are removed when > the given protocol requires or encourages ASCII (e.g. filenames to be > used in URLs, login names, variable names in programming languages, > ancient computer systems). There is no alternate spelling scheme like > German AE/OE/UE/SS. > > Polish leters are never folded when sorting lexicographically. This > applies to Ł in the same way as to other eight letters. Foreign > diacritics are always folded though, at least I don't remember seeing > any other case. I think Ó would be folded together with O in an > encyclopaedia if this is a foreign O with some accent, unrelated to > Polish Ó which is a separate letter (can you suggest some non-Polish > word starting with Ó which could be found in an encyclopaedia?). > > But there are cases when I would prefer to fold Polish diacritics in > searches. > > It's basically every case when you are not sure that all stored data is > using diacritics, for example in generic WWW searching. There are still > people who don't use diacritics in usenet and email, or in entries in > guest books and other "unprofessional" web content. There are even > sometimes people who insist that Polish letters *should not* be used in > usenet and email because some computer systems can't handle them. > Diacritics are rare on IRC (because the IRC protocol doesn't distinguish > between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers > (because of laziness). This is why for searching archives of unknown > data it's generally better to fold them. > > As far as I know, the default UCA folds these letters except Ł, and > standard Polish tailoring doesn't fold any Polish letter. While not > folding them in searching is technically correct and nobody would be > surprised that they are not folded, it's often more useful to fold them > and people would be pleasantly surprised if they don't have to repeat > the search with omitted diacritics. > > If one wants to find data containing a word, rather than collect > statistics about usage of a word with and without diacritics, it's very > rare than folding does some harm. > > Hmm, it's not that simple. When I'm searching for JĘZYK (existing word), > I will be happy to find occurrences of JEZYK too (non-existing word, > must have had diacritics stripped), but it makes no sense to return > JEŻYK (another existing word). It's not just making the letters > equivalent. >
I missed Mark's change in subject - so I replied to Marcin's message right
now under the old subject line:
- Re: Changing UCA primarly weig... Peter Kirk
- Re: Changing UCA primary weigh... Michael Everson
- Re: Changing UCA primary weigh... Peter Kirk
- Re: Changing UCA primary weigh... Mark Davis
- Re: Looking for transcription ... Anto'nio Martins-Tuva'lkin
- Re: Looking for transcription ... Peter Kirk
- Re: Looking for transcription ... John Cowan
- Re: Looking for transcription ... Asmus Freytag
- Re: Looking for transcription ... Marcin 'Qrczak' Kowalczyk
- User Expectations for collatio... Mark Davis
- Re: Looking for transcription ... Asmus Freytag
- Re: Looking for transcription ... Asmus Freytag
- Re: DUCET and supplementary fo... Philippe Verdy
- Re: Looking for transcription or transliteratio... Mark Davis
- RE: Looking for transcription or translite... Jony Rosenne
- Re: Looking for transcription or trans... Simon Montagu
- RE: Looking for transcription or t... Jony Rosenne
- Re: Looking for transcription or trans... Mark Davis
- Re: Looking for transcription or t... Michael Everson
- Re: Looking for transcription ... Jon Hanna
- Re: Looking for transcription or translite... Adam Twardoch