IMO, encodings, particularly ones depending on state such as this, may have
multiple ways to output the same, or similar, sequences. When means that
pretty much any time an encoding transforms data any previous security or other
validation style checks are no longer valid and any
I'm not opposed to a sub-bloc for "Modern Hieroglyphs"
I confess that even though I know nothing about Hieroglyphs, that I find it
fascinating that such a thoroughly dead script might still be living in some
way, even if it's only a little bit.
-Shawn
-Original Message-
From:
> From the point of view of Unicode, it is simpler: If the character is in use
> or have had use, it should be included somehow.
That bar, to me, seems too low. Many things are only used briefly or in a
private context that doesn't really require encoding.
The hieroglyphs discussion is
I think you're overstating my concern :)
I meant that those things tend to be particular to a certain context and often
aren't interesting for interchange. A text editor might find it convenient to
place word boundaries in the middle of something another part of the system
thinks is a single
But... it's not actually discardable. The hypothetical "packet" architecture
(using the term architecture somewhat loosely) needed the information being
tunneled in by this character. If it was actually discardable, then the "noop"
character wouldn't be required as it would be discarded.
Assuming you were using any of those characters as "markup", how would you know
when they were intentionally in the string and not part of your marking system?
-Original Message-
From: Unicode On Behalf Of Richard Wordingham via
Unicode
Sent: Saturday, June 22, 2019 4:17 PM
To:
+ the list. For some reason the list's reply header is confusing.
From: Shawn Steele
Sent: Saturday, June 22, 2019 4:55 PM
To: Sławomir Osipiuk
Subject: RE: Unicode "no-op" Character?
The original comment about putting it between the base character and the
combining diacritic seems peculiar.
I'm curious what you'd use it for?
From: Unicode On Behalf Of Slawomir Osipiuk via
Unicode
Sent: Friday, June 21, 2019 5:14 PM
To: unicode@unicode.org
Subject: Unicode "no-op" Character?
Does Unicode include a character that does nothing at all? I'm talking about
something that can be used
>> If they are obsolete apps, they don’t use CLDR / ICU, as these are designed
>> for up-to-date and fully localized apps. So one hassle is off the table.
Windows uses CLDR/ICU. Obsolete apps run on Windows. That statement is a
little narrowminded.
>> I didn’t look into these date
>> Keeping these applications outdated has no other benefit than providing a
>> handy lobbying tool against support of NNBSP.
I believe you’ll find that there are some French banks and other institutions
that depend on such obsolete applications (unfortunately).
Additionally, I believe you’ll
I've been lurking on this thread a little.
This discussion has gone “all over the place”, however I’d like to point out
that part of the reason NBSP has been used for thousands separators is because
that it exists in all of those legacy codepages that were mentioned predating
Unicode.
Whether
b 2018 21:38:19 +
Shawn Steele via Unicode <unicode@unicode.org> wrote:
> I realize "I'd've" isn't
> "right",
Where did that proscription come from? Is it perhaps a perversion of the
proscription of "I'd of"?
Richard.
For voice we certainly get clues about the speaker's intent from their tone.
That tone can change the meaning of the same written word quite a bit. There
is no need for video to wildly change the meaning of two different readings of
the exact same words.
Writers have always taken liberties
@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding
ill-formed UTF-8
On 6/1/2017 10:41 AM, Shawn Steele via Unicode wrote:
I think that the (or a) key problem is that the current "best practice" is
treated as "SHOULD" in RFC parlance. W
soft.com>
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding
ill-formed UTF-8
On 1 Jun 2017, at 10:32, Henri Sivonen via Unicode <unicode@unicode.org> wrote:
>
> On Wed, May 31, 2017 at 10:42 PM, Shawn Steele via Unicode
> <unicode@unicode.org>
> And *that* is what the specification says. The whole problem here is that
> someone elevated
> one choice to the status of “best practice”, and it’s a choice that some of
> us don’t think *should*
> be considered best practice.
> Perhaps “best practice” should simply be altered to say that
> it’s more meaningful for whoever sees the output to see a single U+FFFD
> representing
> the illegally encoded NUL that it is to see two U+FFFDs, one for an invalid
> lead byte and
> then another for an “unexpected” trailing byte.
I disagree. It may be more meaningful for some applications
> For implementations that emit FFFD while handling text conversion and repair
> (ie, converting ill-formed
> UTF-8 to well-formed), it is best for interoperability if they get the same
> results, so that indices within the
> resulting strings are consistent across implementations for all the
> > In either case, the bad characters are garbage, so neither approach is
> > "better" - except that one or the other may be more conducive to the
> > requirements of the particular API/application.
> There's a potential issue with input methods that indirectly edit the backing
> store. For
> Until TUS 3.1, it was legal for UTF-8 parsers to treat the sequence
> as U+002F.
Sort of, maybe. It was not legal for them to generate it though. So you could
kind of infer that it was not a legal sequence.
-Shawn
> Which is to completely reverse the current recommendation in Unicode 9.0.
> While I agree that this might help you fending off a bug report, it would
> create chances for bug reports for Ruby, Python3, many if not all Web
> browsers,...
& Windows & .Net
Changing the behavior of the Windows
> I think nobody is debating that this is *one way* to do things, and that some
> code does it.
Except that they sort of are. The premise is that the "old language was
wrong", and the "new language is right." The reason we know the old language
was wrong was that there was a bug filed
So basically this came about because code got bugged for not following the
"recommendation." To fix that, the recommendation will be changed. However
then that is going to lead to bugs for other existing code that does not follow
the new recommendation.
I totally get the forward/backward
> If the thread has made one thing clear is that there's no consensus in the
> wider community
> that one approach is obviously better. When it comes to ill-formed sequences,
> all bets are off.
> Simple as that.
> Adding a "recommendation" this late in the game is just bad standards policy.
I
+ the list, which somehow my reply seems to have lost.
> I may have missed something, but I think nobody actually proposed to change
> the recommendations into requirements
No thanks, that would be a breaking change for some implementations (like mine)
and force them to become non-complying or
> Faster ok, privided this does not break other uses, notably for random
> access within strings…
Either way, this is a “recommendation”. I don’t see how that can provide for
not-“breaking other uses.” If it’s internal, you can do what you will, so if
you need the 1:1 seeming parity, then
But why change a recommendation just because it “feels like”. As you said,
it’s just a recommendation, so if that really annoyed someone, they could do
something else (eg: they could use a single FFFD).
If the recommendation is truly that meaningless or arbitrary, then we just get
into silly
to:unicode-boun...@unicode.org] On Behalf Of Richard
Wordingham via Unicode
Sent: Tuesday, May 16, 2017 10:58 AM
To: unicode@unicode.org
Subject: Re: Feedback on the proposal to change U+FFFD generation when decoding
ill-formed UTF-8
On Tue, 16 May 2017 17:30:01 +0000
Shawn Steele via Unicod
> Would you advocate replacing
> e0 80 80
> with
> U+FFFD U+FFFD U+FFFD (1)
> rather than
> U+FFFD (2)
> It’s pretty clear what the intent of the encoder was there, I’d say, and
> while we certainly don’t
> want to decode it as a NUL (that was the source of
>> Disagree. An over-long UTF-8 sequence is clearly a single error. Emitting
>> multiple errors there makes no sense.
>
> Changing a specification as fundamental as this is something that should not
> be undertaken lightly.
IMO, the only think that can be agreed upon is that "something's bad
30 matches
Mail list logo