[AF:]
It is the wording in your posts that adds to the confusion.
My fundamental point is, has been, and continues to be that whenever people use the more general word "code point" instead of the more appropriate "scalar value", that will "add to the confusion". If you make the presupposition <http://en.wikipedia.org/wiki/Presupposition> that your sequence of "code points" or "scalar values" contains no surrogate values, then, yes, this will be
[DE:] truly a distinction without a difference
but if you're using these word without an explicitly stated presupposition, then one will assume that when you mean "code point" you do (surprise, surprise) actually mean "code point", which /according to the official definitions/ will include "surrogate code points". I mentioned this a while ago in a question about ICU, and KenW replied that the real world contains bad data. I also think that this
[DE:] it is very unlikely that Twitter and others are storing and interchanging 
loose surrogates
is incorrect. Not sure whether the Twitter hack I linked to made use of /loose/ surrogates, but it was based on encoding and storing surrogates.


[AF:]
[some paragraphs terminating in:]
Some people writing end user materials may have shown terminological muddle

Sorry to say, but that's apparently the way Twitter misconstrued it. The alternative to a characterization of the way they've interpreted the word "code point" (which is rather un-crazy, but then you're minimizing in your email the extent to which such interpretations or "mis"construals exist online) is to say that Twitter has been, for a long time, /blatantly/ wrong in their official attempt at clarifying the details of the distinguishing feature of their product, after having the product out for an even longer time.

From time to time I will encounter products that appear to handle Unicode but whose string handling gets deeply confused once you enter/paste anything beyond the BMP; you can blame this on confusing "code point" with "code unit" instead, but if the first word didn't exist (because it shouldn't), there would be no confusion.


This qualification
[AF:] by those who have the requisite technical background
of this statement
[AF:] to insinuate that the definitions are widely confused
of course makes it true. As long as "high-surrogate code point" and "low-surrogate code point" aren't officially deprecated, confusion will persist. They should be deprecated, because, /as you say/:
[AF:] Once you add the UTF-prefix, you are, by force, speaking of code units.
So the high-low distinction for "surrogate" code points is misleading, and the "surrogate" attribute for "code point" shouldn't be there, because, as I've in fact written in a much earlier thread and as people know, surrogates are UTF-16-specific.


Stephan

Reply via email to