These provide good examples. It would be interesting to see, of the people
on the [EMAIL PROTECTED] list, how many non-Poles would expect to find the
following orders:

Ab < Äb < Ac
Eb < Äb < Ec
Ob < Ãb < Oc

Ce < Äe < Cy
Ne < Åe < Ny
Sa < Åa < Sy
Za < Åa < Zy
Za < Åa < Zy

and either (a) or (b):

a) La < Åa < Ly    // interleaved
b) La < Ly < Åa    // non-interleaved

âMark

----- Original Message ----- 
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, July 10, 2004 01:02
Subject: Re: Looking for transcription or transliteration standards
latin->arabic


> W liÅcie z piÄ, 09-07-2004, godz. 19:34 -0700, Asmus Freytag napisaÅ:
>
> > o-slash, can be analyzed as o and slash, even though that's not done
> > canonically in Unicode. Allowing users outside Scandinavia to perform
> > fuzzy  searches for words with this character is useful.
> >
> > In this view of folding, Language-specific fuzzy searches would be
tailored
> > (usually by being based on collation information, rather than on generic
> > diacritic folding).
>
> In Polish letters with diacritics ÄÄÄÅÅÃÅÅÅ are sorted after the
> corresponding letters without. Omitting diacritics is an error, even
> though text without them is generally readable. They are removed when
> the given protocol requires or encourages ASCII (e.g. filenames to be
> used in URLs, login names, variable names in programming languages,
> ancient computer systems). There is no alternate spelling scheme like
> German AE/OE/UE/SS.
>
> Polish leters are never folded when sorting lexicographically. This
> applies to Å in the same way as to other eight letters. Foreign
> diacritics are always folded though, at least I don't remember seeing
> any other case. I think à would be folded together with O in an
> encyclopaedia if this is a foreign O with some accent, unrelated to
> Polish à which is a separate letter (can you suggest some non-Polish
> word starting with à which could be found in an encyclopaedia?).
>
> But there are cases when I would prefer to fold Polish diacritics in
> searches.
>
> It's basically every case when you are not sure that all stored data is
> using diacritics, for example in generic WWW searching. There are still
> people who don't use diacritics in usenet and email, or in entries in
> guest books and other "unprofessional" web content. There are even
> sometimes people who insist that Polish letters *should not* be used in
> usenet and email because some computer systems can't handle them.
> Diacritics are rare on IRC (because the IRC protocol doesn't distinguish
> between CP-1250, ISO-8859-2 and UTF-8) and with instant messengers
> (because of laziness). This is why for searching archives of unknown
> data it's generally better to fold them.
>
> As far as I know, the default UCA folds these letters except Å, and
> standard Polish tailoring doesn't fold any Polish letter. While not
> folding them in searching is technically correct and nobody would be
> surprised that they are not folded, it's often more useful to fold them
> and people would be pleasantly surprised if they don't have to repeat
> the search with omitted diacritics.
>
> If one wants to find data containing a word, rather than collect
> statistics about usage of a word with and without diacritics, it's very
> rare than folding does some harm.
>
> Hmm, it's not that simple. When I'm searching for JÄZYK (existing word),
> I will be happy to find occurrences of JEZYK too (non-existing word,
> must have had diacritics stripped), but it makes no sense to return
> JEÅYK (another existing word). It's not just making the letters
> equivalent.
>
> -- 
>    __("<         Marcin Kowalczyk
>    \__/       [EMAIL PROTECTED]
>     ^^     http://qrnik.knm.org.pl/~qrczak/
>
>
>
>


Reply via email to