\uD808\uDF45 specifies a sequence of two codepoints.
That is simply incorrect.
In Java (and similar environments), \u means a char (a UTF16 code
unit), not a code point. Here is the difference. If you are not used to
Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x
Your example would have been better explained by just saying that in Java,
the regexp represented in source code as \\uD808\\uDF45 means matching
two successive 16-bit code units, and \\uD808 or \\uDF45 just matches
one.
The \\u regexo notation (in source code, equivalentto \u in
string
It seems that the broadening of the term interchange in this
corrigendum to mean almost any type of processing imaginable, below,
is what caused the trouble. This is the decision that would need to be
reconsidered if the real intent of noncharacters is to be expressed.
I suspect everyone can
On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell d...@ewellic.org wrote:
I suspect everyone can agree on the edge cases, that noncharacters are
harmless in internal processing, but probably should not appear in
random text shipped around on the web.
Right, in principle. However, it should be ok to
The problem is where to draw the line. In today's world, what's an app? You
may have a cooperating system of apps, where it is perfectly reasonable
to interchange sentinel values (for example).
I agree with Markus; I think the FAQ is pretty clear. (And if not, that's
where we should make it
That’s what I think is exactly what should be clarified. A cooperating system
of apps should likely use some other markup, however if they want to use
to say “OK to insert ad here” (or whatever), that’s up to them.
I fear that the current wording says “Because you might have a cooperating
On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com
wrote:
The “problem” is now that previously these characters were illegal
The problem was that we were inconsistent in standard and related material
about just what the status was for these things.
Mark
Shawn Steele Shawn dot Steele at microsoft dot com wrote:
So I’d prefer to see text that better permitted interchange with other
components of an application’s internal system or partner system, yet
discouraged use for interchange with foreign apps.
If any wording is to be revised, while
On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote:
On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele
shawn.ste...@microsoft.com mailto:shawn.ste...@microsoft.com wrote:
The “problem” is now that previously these characters were illegal
The problem was that we were inconsistent in standard and
On 6/2/2014 9:08 AM, Mark Davis ☕️ wrote:
The problem is where to draw the line. In today's world, what's an
app? You may have a cooperating system of apps, where it is
perfectly reasonable to interchange sentinel values (for example).
The way to draw the line is to insist on there being an
I agree with Markus; I think the FAQ is pretty clear. (And if not,
that's where we should make it clearer.)
But the formal wording of the standard should reflect that clarity, right?
I don't tend to read the FAQ :)
___
Unicode mailing list
I wrote, sort of:
Correct. Most people wouldn't consider a cooperating system like that
quite the same as true public interchange, like throwing this ���
into a message on a public mailing list.
Oh, look. My mail system converted those nice noncharacters into U+FFFD.
Was that compliant? Did
I disagree with that characterization, of course.
The recommendation for libraries and low-level tools to pass them through
rather than screw with them makes them usable. The recommendation to check
for noncharacters from unknown sources and fix them was good advice then,
and is good advice now.
On 6/2/2014 9:38 AM, Shawn Steele wrote:
I agree with Markus; I think the FAQ is pretty clear. (And if not,
that's where we should make it clearer.)
But the formal wording of the standard should reflect that clarity, right?
I don't tend to read the FAQ :)
FAQ's are useful, but they are not
To further my understanding, can someone provide examples of how these are used
in actual practice? I can't think of any offhand and the closest I get is like
the old escape characters to get a dot matrix printer to shift modes, or old
word processor internal formatting sequences.
Oh, look. My mail system converted those nice noncharacters into U+FFFD.
Was that compliant? Did I deserve what I got? Are those two different
questions?
I think I just got spaces.
___
Unicode mailing list
Unicode@unicode.org
On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele shawn.ste...@microsoft.com
wrote:
To further my understanding, can someone provide examples of how these are
used in actual practice?
CLDR collation data defines special contraction mappings that start with a
noncharacter, for
Hmm, I find that disconcerting. I’d prefer a real Unicode character with
special weights if that concept’s needed. And I guess that goes a long ways to
explaining the interchange problem since clearly the code editor’s going to
need these ☹
From: Markus Scherer [mailto:markus@gmail.com]
On Mon, 2 Jun 2014 10:17:04 -0700
Markus Scherer markus@gmail.com wrote:
CLDR collation data defines special contraction mappings that start
with a noncharacter, for
http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers
In CLDR 23 and before (when we were still using
On Mon, 2 Jun 2014 11:29:09 +0200
Mark Davis ☕️ m...@macchiato.com wrote:
\uD808\uDF45 specifies a sequence of two codepoints.
That is simply incorrect.
The above is in the sample notation of UTS #18 Version 17 Section 1.1.
From what I can make out, the corresponding Java notation would
On Mon, Jun 2, 2014 at 8:48 AM, Markus Scherer markus@gmail.com wrote:
Right, in principle. However, it should be ok to include noncharacters in
CLDR data files for processing by CLDR implementations, and it should be
possible to edit and diff and version-control and web-view those files
On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote:
I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if used carelessly for their stated purpose, can
probably trigger serious
On 6/2/2014 2:53 PM, Markus Scherer wrote:
On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com
mailto:prosfil...@gmail.com wrote:
I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable
On Mon, Jun 2, 2014 at 2:53 PM, Markus Scherer markus@gmail.com wrote:
On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote:
I would especially discourage any web browser from handling
these; they're noncharacters used for unknown purposes that are
undisplayable and if
Ø I can't shake the suspicion that Corrigendum #9 is not actually solving a
general problem, but is a special favor to CLDR as being run by insiders, and
in the process muddying the waters for everyone else
I think we could generalize to other scenarios so it wasn’t necessarily an
insider
I better expect: treat them as you like, there will never be any warranty
of interoperability, everyone is allowed to use them as they want and even
change it at any time. The behavior is not defined in TUS, and users cannot
expect that TUS will define this behavior.
There's no clear solution
reserved for CLDR would be wrong in TUS, you have reached a borderline
where you are no longer handling plain text (stream of scalar values
assigned to code points), but binary data via a binary interface outside
TUS (handling streams of collation elements, whose representation is not
even bound
I would like to point out to Asmus that this decision was reached
unanimously at the UTC by Adobe, Apple, Google, IBM, Microsoft, SAP, UC
Berkeley, and Yahoo!
One might disagree with the decision, but there were no special favors
involved.
Lisa
I can't shake the suspicion that
On Mon, 2 Jun 2014 15:09:21 -0700
David Starner prosfil...@gmail.com wrote:
So certain programs can't use noncharacters internally because some
people want to interchange them? That doesn't seem like what
noncharacters should be used for.
Much as I don't like their uninvited use, it is
29 matches
Mail list logo