Doug Ewell <[EMAIL PROTECTED]> writes: > > Wrong here: I have found occurences of dotless lowercase i, used > > instead of soft-dotted lowercase i, as base letters for diacritics > > added above it (it was an accute accent...) > > Don't do that.
What? This is VALID UNICODE to have texts coded like this. The proposed change for soft-dotted/dotless letters used with diacritics is still not in the standard, and it just gives rendering hints so that both base letters should have the same rendering, requiring the use of a explicit dot when the soft dot muct be kept with the diacritic. > > There was two sequences which looked apparently identical when > > rendered, and that were distinct after case folding compare check: > > > > (1) LATIN SMALL LETTER I, COMBINING ACCUTE ACCENT > > (2) LATIN SMALL LETTER DOTLESS I, COMBINING ACCUTE ACCENT > > > > but were no more distinct when converted to uppercase in a locale > > neutral environment not using the Turkic rules: > > > > (1') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT > > (2') LATIN CAPITAL LETTER I, COMBINING ACCUTE ACCENT > > OK, so you want the default, local-neutral case mapping tables to equate > U+0069 with U+0131, right? Yes. And I have good reasons for that, coming from the fact that default locale-neutral mappings tables already equate their uppercase versions U+049 with U+0130, by returning U+0069 for both of them. > This is close to being a spoofing problem, though. See TUS 4.0, page > 141. If you think this is a spoofing problem, then the existing locale-neutral full case mapping of U+0130 is bogous and should not be U+0069.... > > The string (2) may have been produced to avoid displaying the dot > > with some fonts that don't apply the soft-dotted rule when there's > > an additional diacritic above... > > Don't do that. That's misusing the standard. The font should be fixed > instead. For whatever reason, encoded texts exist before correct fonts are used to render them. So there does exist texts which use dotless lowercase i before a diacritic above, simply because the author of the text did not want it to be rendered with a superposed dot. These texts are clearly not Turkic (in Turkish or Azeri, the dot of the soft-dotted i should have been displayed with the diacritic above it, and the dotless i should have been used to avoid it explicitly). But this is not the only reason, I can give other examples which also have security impacts and filesystems impact. Suppose you have a database of user names or file names allowing internationalized names coded along the recommanded Unicode principles. But these names are used in a way that makes it impossible to track the language in which these names are entered (filenames or users names or address fields in a entry form are such cases). Now provide a facility that allows to identify and avoid duplicate case-equivalents, using full mappings. Because you can't track the language, you'll need to use the default case-neutral full case mappings. Now a Turkish user enters a name or address in a entry form, or creates files with dotless lowercase i in it, and attempts to reenter later its case equivalent (dotless) uppercase I. The system will not identify both as being case equivalents, so it will accept both as if they were distinct. The Turkish user or the system then attempts to list files or database table fields matching some regular expression like "i*" with case insensitive option, to count the number of occurences of the names containing a (soft-)dotted i (or I). He will get all files containing one of three codes, and not the fourth one. __________________________________________________________________ << ella for Spam Control >> has removed Spam messages and set aside Newsletters for me You can use it too - and it's FREE! http://www.ellaforspam.com
<<attachment: winmail.dat>>

