On 06/02/2014 09:48 AM, Markus Scherer wrote:
On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell <[email protected]
<mailto:[email protected]>> wrote:

    I suspect everyone can agree on the edge cases, that noncharacters are
    harmless in internal processing, but probably should not appear in
    random text shipped around on the web.


Right, in principle. However, it should be ok to include noncharacters
in CLDR data files for processing by CLDR implementations, and it should
be possible to edit and diff and version-control and web-view those
files etc.

It seems that trying to define "interchange" and "public" in ways that
satisfy everyone will not be successful.

The FAQ already gives some examples of where noncharacters might be
used, should be preserved, or could be stripped, starting with "Q: Are
noncharacters intended for interchange?
<http://www.unicode.org/faq/private_use.html#nonchar6>"

In my view, those Q/A pairs explain noncharacters quite well. If there
are further examples of where noncharacters might be used, should be
preserved, or could be stripped, and that would be particularly useful
to add to the examples already there, then we could add them.

markus



I was unaware of this FAQ. Having read it and re-read this entire thread, I am still troubled.

I have a something like a library that was written a long time ago (not by me) assuming that noncharacters were illegal in open interchange. Programs that use the library were guaranteed that they would not receive noncharacters in their input. They thus are free to use any noncharacter internally as they wish. Now that Corrigendum #9 has come out, I'm getting requests to update the library to not reject noncharacters. The library itself does not use noncharacters. If I (or someone else) makes the requested change, it may silently cause security holes in those programs that were depending on it doing the rejection, and who upgrade to use the new version. Some of these programs may have been written many years ago. The original authors are now dead in some instances, or have turned the code over to someone else, or haven't thought about it in years. The current maintainers of those programs may be unaware of this dependence, and hence may upgrade without realizing the consequences. Further, the old versions of the library will soon be unsupported, so there is pressure to upgrade to get bug fixes and the promise of future support. This means there could be security holes that a hacker who gets a hold of the source can exploit.

I don't see anything in the FAQ that really addresses this situation. I think there should be an answer that addresses code written before the Corrigendum, and that goes into detail about the security issues. My guess is that the UTC did not really consider the potential for security holes when making this Corrigendum.

I agree that CLDR should be able to use noncharacters for internal processing, and that they should be able to be stored in files and edited. But I believe that version control systems and editors have just as much right to use noncharacters for their internal purposes. I disagree with the FAQ that seems to say if you write a utility you should avoid using noncharacters in its implementation. It might be that competitive pressure, or just that the particular implementations don't need non-characters, would cause some such utilities to accept some or all non-characters as inputs. But If I were writing such code, I can see now how using noncharacters for my purposes would be quite convenient. CLDR could be considered to be a utility, and its users might want to use noncharacters for their purposes. Is CLDR constructed so there is no potential for conflicts here? That is, does it reserve certain noncharacters for its own use?

The FAQ talks about how various now-noncharacter code points were touted as sentinel candidates in earlier Unicode versions, and that they are no longer so. But it really should emphasize that old code may very well want to continue to use them as sentinels. The answer "Well, the short answer is no, that is not true—at least, not entirely true." is misleading in this regard.

The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not realize that that was considered representable in any UTF. Likewise -1.

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Reply via email to