On 05/30/2017 02:30 PM, Doug Ewell via Unicode wrote:
L2/17-168 says:
"For UTF-8, recommend evaluating maximal subsequences based on the
original structural definition of UTF-8, without ever restricting trail
bytes to less than 80..BF. For example: is a single maximal
subsequence because C0
L2/17-168 says:
"For UTF-8, recommend evaluating maximal subsequences based on the
original structural definition of UTF-8, without ever restricting trail
bytes to less than 80..BF. For example: is a single maximal
subsequence because C0 was originally a lead byte for two-byte
sequences."
When
Under Best Practices, how many REPLACEMENT CHARACTERs should the
sequence generate? 0, 1, 2, 3, 4 ?
In practice, how many do parsers generate?
> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence
> as U+002F.
Sort of, maybe. It was not legal for them to generate it though. So you could
kind of infer that it was not a legal sequence.
-Shawn
That's not at all the same as saying it was a valid sequence. That's saying
decoders were allowed to be lenient with invalid sequences.
We're supposed to be comfortable with standards language here. Do we really not
understand this distinction?
--Doug Ewell | Thornton, CO, US | ewellic.org
Hello Markus, others,
On 2017/05/27 00:41, Markus Scherer wrote:
On Fri, May 26, 2017 at 3:28 AM, Martin J. Dürst
wrote:
But there's plenty in the text that makes it absolutely clear that some
things cannot be included. In particular, it says
The term “maximal
Hello Karl, others,
On 2017/05/27 06:15, Karl Williamson via Unicode wrote:
On 05/26/2017 12:22 PM, Ken Whistler wrote:
On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
The link provided about the PRI doesn't lead to the comments.
PRI #121 (August, 2008) pre-dated the practice of
On Fri, 26 May 2017 11:22:37 -0700
Ken Whistler via Unicode wrote:
> On 5/26/2017 10:28 AM, Karl Williamson via Unicode wrote:
> > The link provided about the PRI doesn't lead to the comments.
> >
>
> PRI #121 (August, 2008) pre-dated the practice of keeping all the
>
On Tue, 30 May 2017 16:38:45 -0600
Karl Williamson via Unicode wrote:
> Under Best Practices, how many REPLACEMENT CHARACTERs should the
> sequence generate? 0, 1, 2, 3, 4 ?
>
> In practice, how many do parsers generate?
See Markus Kuhn's test page
I have created a tool in python to extract and transform UNIHAN database's
information. It’s open source (MIT-licensed) and offers users customized
outputs. It’s documented extensively at https://unihan-etl.git-pull.com. In
addition, the project’s source code can be found at
> I think nobody is debating that this is *one way* to do things, and that some
> code does it.
Except that they sort of are. The premise is that the "old language was
wrong", and the "new language is right." The reason we know the old language
was wrong was that there was a bug filed
Not as OT as it might seem:
If there are any engineers or designers on this list who worked on 8-bit
and early 16-bit legacy computers (Apple II, Atari, Commodore, Tandy,
etc.), and especially on character set design for these machines, please
contact me privately at . Any desired degree of
> Which is to completely reverse the current recommendation in Unicode 9.0.
> While I agree that this might help you fending off a bug report, it would
> create chances for bug reports for Ruby, Python3, many if not all Web
> browsers,...
& Windows & .Net
Changing the behavior of the Windows
Oh, thank god. I’ve wanted something like this for ages, but I’ve been too
lazy to invest the time to create a serious tool — I’ve used a lot of messy
one-time regular expressions. Will definitely be starring your repo!
14 matches
Mail list logo