Re: Corrigendum #9

Asmus Freytag Fri, 30 May 2014 11:51:28 -0700

On 5/30/2014 11:26 AM, Karl Williamson wrote:

I'm having a problem with this
http://www.unicode.org/versions/corrigendum9.html


You are not alone.

Some people now think it means that noncharacters are really nodifferent from private-use characters, and should be treated verysimilarly if not identically.
It seems to me that they should be illegal in open interchange, orperhaps illegal in interchange without prior agreement.
Any system (process or group of related, cooperating processes) thatuses noncharacters will want to not have any of the ones it usespresent in its inputs. It will want to filter them out of thoseinputs, likely turning each into a REPLACEMENT CHARACTER. If it failsto do that, it leaves itself vulnerable to an attack by hackers, whocan fool it into thinking the input data is different from what itreally is.
Hence, a system that creates outputs containing noncharacters cannotbe assured that any other system will accept those noncharacters.
Thus, I don't see how noncharacters can be considered to be valid inpublic interchange, given that the producers have to assume that theconsumers will not accept them. Producers can assume that consumerswill accept private-use characters, though they may not know theirintent.


This is an important distinction.

One of the concerns was that people felt that they had to have "datapipeline" style implementations (tools) go and filter these out - evenif there was no intent for the implementation to use them internally inany way. Making clear that the standard does not require filteringallows for cleaner implementations of such ("path through) tools.


However, like you, I feel that the corrigendum went to far.

I think the text in 6.2 section 16.7 is good and does not need to bechanged: "Noncharacters ... are forbidden for use in open interchangeof Unicode text data"
Perhaps a bit better wording would be, "are forbidden for use ininterchange of Unicode text data without prior agreement"
The only reason I can think of for your too-large (in my opinion)backing away from what TUS has said about noncharacters since theirinception is to accommodate processes that conform to C7, "thatpurports to not modify the interpretation of a valid coded charactersequence". But, I think there is a better way to do that than whatCorrigendum #9 currently says.
I also am curious as to why the consecutive group of 32 noncharacterscan't be split off into its own block instead of being part of anArabic one. I'm unaware of any stability policy forbidding this.Another block is to be split, if I recall correctly, to accommodatethe new Cherokee characters.

This might have been possible at the time these were added, but now itis probably not feasible. One of the reasons is that block names areexposed (for better or for worse) as character properties and as suchare also exposed in regular expressions. While not recommended, it wouldbe really bad if the expression with pseudo-code"IsInArabicPresentationFormB(x)" were to fail, because we split theblock into three (with the middle one being the noncharacters).

It's the usual dance: is it better to prevent such breakage, or is itbetter to not pile up more "exceptions" like noncharacters being filedunder Arabic Presentation forms. The damage from the former is directand immediate and eventually decays. The damage from the latter issubtle and cumulative over time.


Tough choice.

A./

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode


_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

Reply via email to