Re: Corrigendum #9

Karl Williamson Wed, 11 Jun 2014 21:34:27 -0700

On 06/02/2014 09:48 AM, Markus Scherer wrote:

On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell <[email protected]
<mailto:[email protected]>> wrote:


    I suspect everyone can agree on the edge cases, that noncharacters are
    harmless in internal processing, but probably should not appear in
    random text shipped around on the web.


Right, in principle. However, it should be ok to include noncharacters
in CLDR data files for processing by CLDR implementations, and it should
be possible to edit and diff and version-control and web-view those
files etc.

It seems that trying to define "interchange" and "public" in ways that
satisfy everyone will not be successful.

The FAQ already gives some examples of where noncharacters might be
used, should be preserved, or could be stripped, starting with "Q: Are
noncharacters intended for interchange?
<http://www.unicode.org/faq/private_use.html#nonchar6>"

In my view, those Q/A pairs explain noncharacters quite well. If there
are further examples of where noncharacters might be used, should be
preserved, or could be stripped, and that would be particularly useful
to add to the examples already there, then we could add them.

markus

I was unaware of this FAQ. Having read it and re-read this entirethread, I am still troubled.

I have a something like a library that was written a long time ago (notby me) assuming that noncharacters were illegal in open interchange.Programs that use the library were guaranteed that they would notreceive noncharacters in their input. They thus are free to use anynoncharacter internally as they wish. Now that Corrigendum #9 has comeout, I'm getting requests to update the library to not rejectnoncharacters. The library itself does not use noncharacters. If I (orsomeone else) makes the requested change, it may silently cause securityholes in those programs that were depending on it doing the rejection,and who upgrade to use the new version. Some of these programs may havebeen written many years ago. The original authors are now dead in someinstances, or have turned the code over to someone else, or haven'tthought about it in years. The current maintainers of those programsmay be unaware of this dependence, and hence may upgrade withoutrealizing the consequences. Further, the old versions of the librarywill soon be unsupported, so there is pressure to upgrade to get bugfixes and the promise of future support. This means there could besecurity holes that a hacker who gets a hold of the source can exploit.

I don't see anything in the FAQ that really addresses this situation. Ithink there should be an answer that addresses code written before theCorrigendum, and that goes into detail about the security issues. Myguess is that the UTC did not really consider the potential for securityholes when making this Corrigendum.

I agree that CLDR should be able to use noncharacters for internalprocessing, and that they should be able to be stored in files andedited. But I believe that version control systems and editors havejust as much right to use noncharacters for their internal purposes. Idisagree with the FAQ that seems to say if you write a utility youshould avoid using noncharacters in its implementation. It might bethat competitive pressure, or just that the particular implementationsdon't need non-characters, would cause some such utilities to acceptsome or all non-characters as inputs. But If I were writing such code,I can see now how using noncharacters for my purposes would be quiteconvenient. CLDR could be considered to be a utility, and its usersmight want to use noncharacters for their purposes. Is CLDR constructedso there is no potential for conflicts here? That is, does it reservecertain noncharacters for its own use?

The FAQ talks about how various now-noncharacter code points were toutedas sentinel candidates in earlier Unicode versions, and that they are nolonger so. But it really should emphasize that old code may very wellwant to continue to use them as sentinels. The answer "Well, the shortanswer is no, that is not true—at least, not entirely true." ismisleading in this regard.

The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did notrealize that that was considered representable in any UTF. Likewise -1.


_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

Reply via email to