On Mon, 19 Dec 2016 20:54:31 -0700
Doug Ewell wrote:
> There isn't much to be gained by collapsing the bad bytes to a single
> replacement character. However, doing so does remove the information
> about how many bytes were invalid and that may have value to a user
> in
On 2016/12/20 11:35, Tex Texin wrote:
Shawn,
Ok, but that begs the questions of what to do...
"All bets are off" is not instructive.
Well, it may be instructive in that its difficult to get software to
decide what happened. A human may be in a better position to analyze the
error and the
On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson
wrote:
> It seems counterintuitive to me that the two byte sequence C0 80 should be
> replaced by 2 replacement characters under best practices, or that E0 80 80
> should also be replaced by 2. Each sequence was legal in
Shawn,
Ok, but that begs the questions of what to do...
"All bets are off" is not instructive.
How software behaves in the face of invalid bytes, what it does with them, what
it does about them, and how it continues (or not) still needs to be determined.
tex
-Original Message-
From:
I thought there was a corrigendum or other, comparatively recent addition to
the Standard that spelled out how replacement characters are supposed to be
substituted for invalid code unit sequences -- something about detecting
maximally long sequences. I'll look when I have a chance.
--Doug
If there is a short sequence of invalid bytes presumed to be one character,
then one vs several replacement characters may not matter. But if it were a
longer sequence that might have been several invalidly coded characters, then
multiple replacement characters would give a more correct
On Mon, 19 Dec 2016 16:04:06 -0700
Karl Williamson wrote:
> What are the advantages to replacing them by multiple characters
Presumably it just provides more pain for those who code using UTF-8 as
opposed to UTF-16, just like the *former* requirements to be able to be
On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson
wrote:
> It seems counterintuitive to me that the two byte sequence C0 80 should be
> replaced by 2 replacement characters under best practices, or that E0 80 80
> should also be replaced by 2. Each sequence was legal in
Karl Williamson wrote:
> It seems counterintuitive to me that the two byte sequence C0 80
> should be replaced by 2 replacement characters under best practices,
> or that E0 80 80 should also be replaced by 2. Each sequence was legal
> in early Unicode versions,
This is overstated at best.
It seems counterintuitive to me that the two byte sequence C0 80 should
be replaced by 2 replacement characters under best practices, or that E0
80 80 should also be replaced by 2. Each sequence was legal in early
Unicode versions, and it seems that it would be best to treat them as
each a
10 matches
Mail list logo