On Wed, Jun 11, 2014 at 9:29 PM, Karl Williamson <[email protected]> wrote:
> I have a something like a library that was written a long time ago (not by > me) assuming that noncharacters were illegal in open interchange. Programs > that use the library were guaranteed that they would not receive > noncharacters in their input. They thus are free to use any noncharacter > internally as they wish. Now that Corrigendum #9 has come out, I'm getting > requests to update the library to not reject noncharacters. The library > itself does not use noncharacters. If I (or someone else) makes the > requested change, it may silently cause security holes in those programs > that were depending on it doing the rejection, and who upgrade to use the > new version. > If your library makes an explict promise to remove noncharacters, then it should continue to do so. However, if your library is understood to pass through any strings, except for the advertised processing, then noncharacters should probably be preserved. I don't see anything in the FAQ that really addresses this situation. I > think there should be an answer that addresses code written before the > Corrigendum, and that goes into detail about the security issues. My guess > is that the UTC did not really consider the potential for security holes > when making this Corrigendum. > There is nothing really new in the corrigendum. The UTC felt that some implementers had misinterpreted inconsistent and misleading statements in and around the standard, and clarified the situation. Any process that requires certain characters or sequences to not occur in the input must explicitly check for those, regardless of whether they are noncharacter, private use characters, unassigned code points, control codes, deprecated language tag characters, discouraged stateful formatting controls, stacks of hundreds of diacritics, or whatever. In a sense, noncharacters are much like the old control codes. Some terminals say "beep" when they see U+0007, or go into strange modes when they see U+001B; on Windows, when you read a text file that contains U+001A, it is interpreted as an end-of-file marker. If your process depended on those things not happening, then you would have to strip those control codes on input. But a pass-through-style library will be universally expected not to do anything special with them. I agree that CLDR should be able to use noncharacters for internal > processing, and that they should be able to be stored in files and edited. > But I believe that version control systems and editors have just as much > right to use noncharacters for their internal purposes. I disagree. If svn or git choked on noncharacters or control codes or private use characters or unassigned code points etc., I would complain. Likewise, I expect to be able to use plain text or programming editors (gedit, kate, vi, emacs, Visual Studio) to handle files with such characters just fine. I do *not* necessarily expect Word, OpenOffice, or Google Docs to handle all of these. Is CLDR constructed so there is no potential for conflicts here? That is, > does it reserve certain noncharacters for its own use? > I believe that CLDR only uses noncharacters for special purposes in collation. In CLDR data files, there are at most contraction mappings that start with noncharacters for purposes of building alphabetic-index tables. (And those noncharacters are \u-escaped in CLDR XML files since CLDR 24.) There is no mechanism to remove them from any input, but the worst thing that would happen is that you get a sequence of code points to sort interestingly. The FAQ mentions using 0x7FFFFFFF as a possible sentinel. I did not > realize that that was considered representable in any UTF. Likewise -1. > No, and that's the point of using those. Integer values that are not code points make for great sentinels in API functions, such as a next() iterator returning -1 when there is no next character. markus
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

