Re: Corrigendum #9

CE Whitehead Tue, 24 Jun 2014 06:20:36 -0700


Markus Scherer said what sounds right to me to recommend (maybe what he says 
should be said in Corrigendum 9):
http://www.unicode.org/mail-arch/unicode-ml/y2014-m06/0148.html


From: Markus Scherer <markus.icu_at_gmail.com>
Date: Thu, 12 Jun 2014 01:37:49 -0700
> If your library makes an explict promise to remove noncharacters, then it
> should continue to do so.
> However, if your library is understood to pass through any strings, except
> for the advertised processing, then noncharacters should probably be
> preserved. 
ME: Am I to believe from the above, that,
regarding  www.unicode.org/L2/L2013/13015-nonchars.pdf (which rejects the bold 
interpretation but I don't think that's what Markus's email does) --
the "'bold interpretation' of internal exchange of noncharacters" may continue
where deletion of a noncharacter is never a good idea, and should not happen, 
that unrecognized noncharacters should simply be silently ignored then,
with  "all Unicode scalar values, including those corresponding to noncharacter 
code points and unassigned code points," thus "mapped to unique code unit 
sequences";
while, at the same time (albeit as I understand things only if the type of 
encoding is recognized),
noncharacters may replaced with the scalar for unassigned code points (U+FFFD)? 
In this latter case the non-character is no longer mapped one-to-one with a 
scalar as all noncharacters will have been replaced with U+FFFD. So is that 
one-to-one mapping recommendation going to be changed or not?


* * *
I also have a quesiton on Peter's notes on TUS 6.0 rule C7 (which followed the 
Unicode 4.0 correction apparently if I understand correctly; maybe I should 
have sent this question as a separate email) 

http://www.unicode.org/mail-arch/unicode-ml/y2014-m06/0151.html
From: Peter Constable <petercon_at_microsoft.com>
Date: Fri, 13 Jun 2014 05:14:30 +0000
> TUS 6.0:
> C2 = TUS5.0, C2

"C7 When a process purports not to modify the interpretation of a valid coded 
character
sequence, it shall make no change to that coded character sequence other than 
the possible
replacement of character sequences by their canonical-equivalent sequences."

> Interestingly, the change to C7 does not permit non-characters to be replaced 
> or removed at all while claiming not to have left the interpretation intact.
ME: if two sequences are canonically equivalent except that one has 
noncharacters in it, are these still canonically equivalent? (just a wild 
question; would be nice to have an answer in the faq on noncharacters or 
somewhere; mabye I missed the answer and it was there).
*  * *
Sentinels, Security

Regarding the sentinels; I am an outsider but assume that with Corrigendum 9 
U+FFFE will continue to be mentioned as having generally (not always?) standard 
use throughout;
in Chapter 16.7 it is currently mentioned; I assume it will still be --
according to info. in the FAQ and elsewhere:
http://www.unicode.org/faq/private_use.html 
 "U+FFFE. The 16-bit unsigned hexadecimal value U+FFFE is not a Unicode 
character value, and should be taken as a signal that Unicode characters should 
be byte-swapped before interpretation. U+FFFE should only be intepreted as an 
incorrectly byte-swapped version of U+FEFF" 

Yes, it would be nice also to have info about security effects I agree of any 
other sentinels particularly U+FFFF and U+10FFFF 
-- but I envision most security effects would be caused by removing without 
replacing one of these (is that right?)

Hope these questions are helpful.
Best,

--C. E. Whitehead
[email protected]

_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Re: Corrigendum #9

Reply via email to