subject:"Best practices for replacing UTF\-8 overlongs"

Re: Best practices for replacing UTF-8 overlongs

2016-12-20 Thread Ken Whistler

On 12/20/2016 10:33 AM, Markus Scherer wrote: Yes. However, some of the discussion in this thread is due to details that were not spelled out in the PRI. There is basically a 2a and a 2b, while the examples in PRI #121 work the same in both variants. I wasn't intending to argue the case

Re: Best practices for replacing UTF-8 overlongs

2016-12-20 Thread Markus Scherer

On Tue, Dec 20, 2016 at 8:59 AM, Ken Whistler wrote: > You found the resulting text in TUS 9.0, p. 126 - 129. The origin of the > text there about best practices for using U+FFFD was the discussion and > resolution of PRI #121 in August, 2008: > >

Re: Best practices for replacing UTF-8 overlongs

2016-12-20 Thread Ken Whistler

Doug, On 12/19/2016 6:08 PM, Doug Ewell wrote: I thought there was a corrigendum or other, comparatively recent addition to the Standard that spelled out how replacement characters are supposed to be substituted for invalid code unit sequences -- something about detecting maximally long

Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Richard Wordingham

On Mon, 19 Dec 2016 20:54:31 -0700 Doug Ewell wrote: > There isn't much to be gained by collapsing the bad bytes to a single > replacement character. However, doing so does remove the information > about how many bytes were invalid and that may have value to a user > in

Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Martin J. Dürst

On 2016/12/20 11:35, Tex Texin wrote: Shawn, Ok, but that begs the questions of what to do... "All bets are off" is not instructive. Well, it may be instructive in that its difficult to get software to decide what happened. A human may be in a better position to analyze the error and the

Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread J Decker

On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson wrote: > It seems counterintuitive to me that the two byte sequence C0 80 should be > replaced by 2 replacement characters under best practices, or that E0 80 80 > should also be replaced by 2. Each sequence was legal in

RE: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Tex Texin

- From: Shawn Steele [mailto:shawn.ste...@microsoft.com] Sent: Monday, December 19, 2016 5:41 PM To: Tex Texin; 'Doug Ewell'; 'Unicode Mailing List' Cc: 'Karl Williamson' Subject: RE: Best practices for replacing UTF-8 overlongs IMO, bad bytes == corruption. At that point all bets are o

RE: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Doug Ewell

: 'Karl Williamson' <pub...@khwilliamson.com> Subject: RE: Best practices for replacing UTF-8 overlongs If there is a short sequence of invalid bytes presumed to be one character, then one vs several replacement characters may not matter. But if it were a longer sequence that might hav

RE: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Tex Texin

of the document is suspect. tex -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Shawn Steele Sent: Monday, December 19, 2016 4:26 PM To: Doug Ewell; Unicode Mailing List Cc: Karl Williamson Subject: RE: Best practices for replacing UTF-8 overlongs IMO

Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Richard Wordingham

On Mon, 19 Dec 2016 16:04:06 -0700 Karl Williamson wrote: > What are the advantages to replacing them by multiple characters Presumably it just provides more pain for those who code using UTF-8 as opposed to UTF-16, just like the *former* requirements to be able to be

Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Markus Scherer

On Mon, Dec 19, 2016 at 3:04 PM, Karl Williamson wrote: > It seems counterintuitive to me that the two byte sequence C0 80 should be > replaced by 2 replacement characters under best practices, or that E0 80 80 > should also be replaced by 2. Each sequence was legal in

Re: Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Doug Ewell

Karl Williamson wrote: > It seems counterintuitive to me that the two byte sequence C0 80 > should be replaced by 2 replacement characters under best practices, > or that E0 80 80 should also be replaced by 2. Each sequence was legal > in early Unicode versions, This is overstated at best.

Best practices for replacing UTF-8 overlongs

2016-12-19 Thread Karl Williamson

It seems counterintuitive to me that the two byte sequence C0 80 should be replaced by 2 replacement characters under best practices, or that E0 80 80 should also be replaced by 2. Each sequence was legal in early Unicode versions, and it seems that it would be best to treat them as each a

Re: Best practices for replacing UTF-8 overlongs

Re: Best practices for replacing UTF-8 overlongs

Re: Best practices for replacing UTF-8 overlongs

Re: Best practices for replacing UTF-8 overlongs

Re: Best practices for replacing UTF-8 overlongs

Re: Best practices for replacing UTF-8 overlongs

RE: Best practices for replacing UTF-8 overlongs

RE: Best practices for replacing UTF-8 overlongs

RE: Best practices for replacing UTF-8 overlongs

Re: Best practices for replacing UTF-8 overlongs

Re: Best practices for replacing UTF-8 overlongs

Re: Best practices for replacing UTF-8 overlongs

Best practices for replacing UTF-8 overlongs

13 matches

Site Navigation

Mail list logo

Footer information