"Philippe Verdy" <[EMAIL PROTECTED]> writes: [...] > This was later amended in an errata for XML 1.0 which now says that > the list of code points whose use is *discouraged* (but explicitly > *not* forbidden) for the "Char" production is now: [...]
Ugh, it's a mess... IMHO Unicode is partially to blame, by introducing various kinds of holes in code point numbering (non-characters, surrogages), by not being clear when the unit of processing should be a code point and when a combining character sequence, and earlier by pushing UTF-16 as the fundamental representation of the text (which led to such horrible descriptions as http://www.xml.com/axml/notes/Surrogates.html). XML is just an example of a standard which must decide: A. What is the unit of text processing? (code point? combining character sequence? something else? hopefully it would not be UTF-16 unit) B. Which (sequences of) characters are valid when present in the raw source, i.e. what UTF-n really means? C. Which (sequences of) characters can be formed by specifying a character number? A programming language must do the same. The language Kogut I'm designing and developing uses Unicode as string representation, but the details can still be changed. I want to have rules which are "correct" as far as Unicode is concerned, and which are simple enough to be practical (e.g. if a standard forced me to make the conversion from code point number to actual character contextual, or if it forced me to unconditionally unify precomposed and decomposed characters, then I quit and won't support a broken standard). Internal text processing in a programming language can be more permissive than an application of such processing like XML parsing: if a particular character is valid in UTF-8 but XML disallows it, everything is fine, it can be rejected at some stage. It must not be more restrictive however, as it would make impossible to implement XML parsing in terms of string processing. Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining character sequences. I'm afraid that anything other than a mixture of 1 and 3 is too complicated to be widely used. Almost everybody is representing strings either as code points, or as even lower-level units like UTF-16 units. And while 2 is nice from the user's point of view, it's a nightmare from the programmer's point of view: - Unicode character properties (like general category, character name, digit value) are defined in terms of code points. Choosing 2 would immediately require two-stage processing: a string is a sequence of sequences of code points. - Unicode algorithms (like collation, case mapping, normalization) are specified in terms of code points. - Data exchange formats (UTF-n) are always closer to code points than to combining character sequences. - Code points have a finite domain, so you can make dictionaries indexed by code points; for combining character sequences we would be forced to make functions which *compute* the relevant property basing on the structure of such a sequence. I don't believe 2 is workable at all. The question is how to make 3 convenient enough to be used more often. Unfortunately it's much harder than 1, unless strings used some completely different iteration protocols than other sequences. I don't have an idea how to make 3 convenient. Regarding B in the context of a programming language (not XML), chapter 3.9 of the Unicode standard version 4.0 excludes only surrogates: it does not exclude non-characters like U+FFFF. But non-characters must be excluded somewhere, because otherwise U+FFFE at the beginning would be mistaken for a BOM. I'm confused. Regarding C, I'm confused too. Should a function which returns the character of the given number accept surrogates? I guess no. Should it accept non-characters? I don't know. I only know that it should not accept values above 0x10FFFF. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

