Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote: L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence because C0

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode
L2/17-168 says: "For UTF-8, recommend evaluating maximal subsequences based on the original structural definition of UTF-8, without ever restricting trail bytes to less than 80..BF. For example: is a single maximal subsequence because C0 was originally a lead byte for two-byte sequences." When

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Karl Williamson via Unicode
Under Best Practices, how many REPLACEMENT CHARACTERs should the sequence generate? 0, 1, 2, 3, 4 ? In practice, how many do parsers generate?

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence > as U+002F. Sort of, maybe. It was not legal for them to generate it though. So you could kind of infer that it was not a legal sequence. -Shawn

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Doug Ewell via Unicode
That's not at all the same as saying it was a valid sequence. That's saying decoders were allowed to be lenient with invalid sequences. We're supposed to be comfortable with standards language here. Do we really not understand this distinction? --Doug Ewell | Thornton, CO, US | ewellic.org

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode
Hello Markus, others, On 2017/05/27 00:41, Markus Scherer wrote: On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst wrote: But there's plenty in the text that makes it absolutely clear that some things cannot be included. In particular, it says The term “maximal

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Martin J. Dürst via Unicode
Hello Karl, others, On 2017/05/27 06:15, Karl Williamson via Unicode wrote: On 05/26/2017 12:22 PM, Ken Whistler wrote: On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: The link provided about the PRI doesn't lead to the comments. PRI #121 (August, 2008) pre-dated the practice of

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Richard Wordingham via Unicode
On Fri, 26 May 2017 11:22:37 -0700 Ken Whistler via Unicode wrote: > On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote: > > The link provided about the PRI doesn't lead to the comments. > > > > PRI #121 (August, 2008) pre-dated the practice of keeping all the >

Re: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Richard Wordingham via Unicode
On Tue, 30 May 2017 16:38:45 -0600 Karl Williamson via Unicode wrote: > Under Best Practices, how many REPLACEMENT CHARACTERs should the > sequence generate? 0, 1, 2, 3, 4 ? > > In practice, how many do parsers generate? See Markus Kuhn's test page

unihan-etl: create exports of UNIHAN db to csv, json and yaml

2017-05-30 Thread Tony Narlock via Unicode
I have created a tool in python to extract and transform UNIHAN database's information. It’s open source (MIT-licensed) and offers users customized outputs. It’s documented extensively at https://unihan-etl.git-pull.com. In addition, the project’s source code can be found at

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> I think nobody is debating that this is *one way* to do things, and that some > code does it. Except that they sort of are. The premise is that the "old language was wrong", and the "new language is right." The reason we know the old language was wrong was that there was a bug filed

Looking for 8-bit computer designers

2017-05-30 Thread Doug Ewell via Unicode
Not as OT as it might seem: If there are any engineers or designers on this list who worked on 8-bit and early 16-bit legacy computers (Apple II, Atari, Commodore, Tandy, etc.), and especially on character set design for these machines, please contact me privately at . Any desired degree of

RE: Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8

2017-05-30 Thread Shawn Steele via Unicode
> Which is to completely reverse the current recommendation in Unicode 9.0. > While I agree that this might help you fending off a bug report, it would > create chances for bug reports for Ruby, Python3, many if not all Web > browsers,... & Windows & .Net Changing the behavior of the Windows

Re: unihan-etl: create exports of UNIHAN db to csv, json and yaml

2017-05-30 Thread Rebecca T via Unicode
Oh, thank god. I’ve wanted something like this for ages, but I’ve been too lazy to invest the time to create a serious tool — I’ve used a lot of messy one-time regular expressions. Will definitely be starring your repo!