2014-05-30 20:49 GMT+02:00 Asmus Freytag <[email protected]>: > This might have been possible at the time these were added, but now it is > probably not feasible. One of the reasons is that block names are exposed > (for better or for worse) as character properties and as such are also > exposed in regular expressions. While not recommended, it would be really > bad if the expression with pseudo-code "IsInArabicPresentationFormB(x)" > were to fail, because we split the block into three (with the middle one > being the noncharacters). >
If you think about pseudocode testing for properties then nothing forbifs the test IsInArabicPresentationFormB(x) to check two ranges onstead of just one. Almost all character properties are using multiple ranges of characters (including the more useful properties needed in lots of place in the code, so updating it so that the property covers two ranges is not a major change). But anyway, I have never see the non-characters in the Arabic presentation forms used elsewhere than within legacy Arabic fonts, using these code points to map... Arabic presentation forms. OK, text documents do not need to encode these legay forms in order to use these fonts (text renderers don't need them with modern OpenType fonts but will still use them in legacy non-OpenType TTF fonts, as a tentative fallback to render these contextual forms). So basically there's no interchange of *text* but the fonts using these codepoints are still interchanged. I think it would be better to just reassign these characters are compatibility characters (or even as PUA) and not as non-characters. I see no rational for keeping them illegal, when it just cause unnecessary complications for document validation. After all most C0 and C1 controls also don't have any other interchangeable semantic except being "controls" which are always application and protocol dependant (not meant for encoding texts, except in legacy more or less "rich" encodings (e.g. for storing escape sequences, not standardized and fully dependant on the protocol or terminal type, or on various legacy standards that did not separate text from style, or for the many protocols that need them for special purpose, such as tagginh content, switching code pages changing colors and font styles, positioning on a screen or input form, adding foratting metadata, implement out-of-band commands, starting/stopping records, pacing the bandwidth use, starting/ending/redirecting/splitting/merging sessions, embed non-text content such as bitmap images or structured data, changing transport protocol options such as compresion schemes, exchanhing encryption/decryption keys, adding checksum controls or autocorrection data, marking redundant data copies, inserting resynchronization points for error recovery...) So these "non-characters" in Arabic presentation forms are to be treated more or less like most C1 controls that have undefined behavior. Saying that there's a need for a "prior agreement" the agreement may be explicit by the fact that they are used in some old font formats (the same is true about old fonts using PUA assignments: the kind of agreement is basically the same, and in both cases, fonts are not plain-text documents). So the good queston for us is only to be able to reply to this question: "is this document a valid and conforming plain-text ?" If: * (1) your document contains - any one in most of the C0 or C1 controls (except CR, LF, VT, FF, and NL from C1) - any one in PUA - any one in non-characters - any unpaired surrogates * and (2) your document does not validate its encoding scheme, Then it is not plain-text (to be interchangeable it also needs a recognized standard encoding, which also requires an agreement or a specification in the protocol or file format used to transport it). Personnally I think that surrogates are also non-characters. They are not assigned to any character even if a pair of encodings are using them internally to represent code units (not directly code points which are converted first in two code units); this means that some documents are valid UTF16 and UTF-32 documents even if they are not plain-text with the current system (I don't like this situation because UTF-16 and UTF-32 documents are supposed to be interchangeable, even if they are not all convertible to UTF-8). But with the non-characters in the Arabic presentation forms, all is made as if they were reserved for a possible future encoding that could use them internally for representing some text using sequences of code units containing or starting by them, or for some still mysterious encoding of a PUA agreement with an unspecified protocol (exactly the same situation as with most C1 controls), or as possible replacement for some code units that could collide with the internal use of some standard controls with some protocols (e.g. to reencode a NULL, or to delimit the end of an variable-length escape sequence, when all other C0 and C1 controls are already used in a terminal protocol). But even in this case, it will be difficult to consider documents containing them as "plain-text". ---- Note: I do not discuss the 34 non-characters in positions U+xxFFFE and U+xxFFFF: keep them as non-characters, they are sufficient for all possible internal use (in fact ony U+FFFE and U+FFFF are needed: the first one for determining the byte order in streams that accept either big-endian or little-endian ordering, the second to mark the end of a stream), and I've still never seen any application needing more non-characters from the Arabic presentation form for such use. The non-character U+FFFE can be used to detect the byte order in UTF-16 adnd UTF-32, but not the bit order in bytes, (because U+7FFF and U+FF7F are not non-characters). This is a problem in some protocols that can accept both without an explicit prior encoding of this order (they could need another non-character to help determining the bit order, in which case the encoding of the non-character U+1FFFE could be used: if the bit order is swapped in UTF-16 we get 0xDFFE as the second UTF-16 encoding from which we can determine the bit order from the position of the clear bit with value 0x20000). But unlike the cirrent code unit 0xFEFF used to detect swapped BOM (whcih is considered valid character and in-band, stripped conditionnaly only in the leading position), 0x1FFFE could be treated as non-character and its presence always out-of-band, so once the bit and byte order has been detected, or changed with it within a stream of code units, it can always be stripped from the plain-text output of code points. May be in the furure we'll need more distinctive order marks for bits, bytes, code units, but I am convinced that the 34 codepojnts U+xFFFE and U+xFFFF will be far enough **without** ever needing to use the non-characters in the Arabic presentation form block.
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

