Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Mark Davis ☕️
\uD808\uDF45 specifies a sequence of two codepoints. ​That is simply incorrect.​ In Java (and similar environments), \u means a char (a UTF16 code unit), not a code point. Here is the difference. If you are not used to Java, string.replaceAll(x,y) uses Java's regex to replace the pattern x

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Philippe Verdy
Your example would have been better explained by just saying that in Java, the regexp represented in source code as \\uD808\\uDF45 means matching two successive 16-bit code units, and \\uD808 or \\uDF45 just matches one. The \\u regexo notation (in source code, equivalentto \u in string

Re: Corrigendum #9

2014-06-02 Thread Doug Ewell
It seems that the broadening of the term interchange in this corrigendum to mean almost any type of processing imaginable, below, is what caused the trouble. This is the decision that would need to be reconsidered if the real intent of noncharacters is to be expressed. I suspect everyone can

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 8:27 AM, Doug Ewell d...@ewellic.org wrote: I suspect everyone can agree on the edge cases, that noncharacters are harmless in internal processing, but probably should not appear in random text shipped around on the web. Right, in principle. However, it should be ok to

Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of apps, where it is perfectly reasonable to interchange sentinel values (for example). I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
That’s what I think is exactly what should be clarified. A cooperating system of apps should likely use some other markup, however if they want to use to say “OK to insert ad here” (or whatever), that’s up to them. I fear that the current wording says “Because you might have a cooperating

Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com wrote: The “problem” is now that previously these characters were illegal The problem was that we were inconsistent in standard and related material about just what the status was for these things. Mark

RE: Corrigendum #9

2014-06-02 Thread Doug Ewell
Shawn Steele Shawn dot Steele at microsoft dot com wrote: So I’d prefer to see text that better permitted interchange with other components of an application’s internal system or partner system, yet discouraged use for interchange with foreign apps. If any wording is to be revised, while

Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag
On 6/2/2014 9:27 AM, Mark Davis ☕️ wrote: On Mon, Jun 2, 2014 at 6:21 PM, Shawn Steele shawn.ste...@microsoft.com mailto:shawn.ste...@microsoft.com wrote: The “problem” is now that previously these characters were illegal The problem was that we were inconsistent in standard and

Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag
On 6/2/2014 9:08 AM, Mark Davis ☕️ wrote: The problem is where to draw the line. In today's world, what's an app? You may have a cooperating system of apps, where it is perfectly reasonable to interchange sentinel values (for example). The way to draw the line is to insist on there being an

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) But the formal wording of the standard should reflect that clarity, right? I don't tend to read the FAQ :) ___ Unicode mailing list

RE: Corrigendum #9

2014-06-02 Thread Doug Ewell
I wrote, sort of: Correct. Most people wouldn't consider a cooperating system like that quite the same as true public interchange, like throwing this ��� into a message on a public mailing list. Oh, look. My mail system converted those nice noncharacters into U+FFFD. Was that compliant? Did

Re: Corrigendum #9

2014-06-02 Thread Mark Davis ☕️
I disagree with that characterization, of course. The recommendation for libraries and low-level tools to pass them through rather than screw with them makes them usable. The recommendation to check for noncharacters from unknown sources and fix them was good advice then, and is good advice now.

Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag
On 6/2/2014 9:38 AM, Shawn Steele wrote: I agree with Markus; I think the FAQ is pretty clear. (And if not, that's where we should make it clearer.) But the formal wording of the standard should reflect that clarity, right? I don't tend to read the FAQ :) FAQ's are useful, but they are not

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
To further my understanding, can someone provide examples of how these are used in actual practice? I can't think of any offhand and the closest I get is like the old escape characters to get a dot matrix printer to shift modes, or old word processor internal formatting sequences.

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Oh, look. My mail system converted those nice noncharacters into U+FFFD. Was that compliant? Did I deserve what I got? Are those two different questions? I think I just got spaces. ___ Unicode mailing list Unicode@unicode.org

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 10:00 AM, Shawn Steele shawn.ste...@microsoft.com wrote: To further my understanding, can someone provide examples of how these are used in actual practice? CLDR collation data defines special contraction mappings that start with a noncharacter, for

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Hmm, I find that disconcerting. I’d prefer a real Unicode character with special weights if that concept’s needed. And I guess that goes a long ways to explaining the interchange problem since clearly the code editor’s going to need these ☹ From: Markus Scherer [mailto:markus@gmail.com]

Re: Corrigendum #9

2014-06-02 Thread Richard Wordingham
On Mon, 2 Jun 2014 10:17:04 -0700 Markus Scherer markus@gmail.com wrote: CLDR collation data defines special contraction mappings that start with a noncharacter, for http://www.unicode.org/reports/tr35/tr35-collation.html#CJK_Index_Markers In CLDR 23 and before (when we were still using

Re: Unicode Regular Expressions, Surrogate Points and UTF-8

2014-06-02 Thread Richard Wordingham
On Mon, 2 Jun 2014 11:29:09 +0200 Mark Davis ☕️ m...@macchiato.com wrote: \uD808\uDF45 specifies a sequence of two codepoints. ​That is simply incorrect.​ The above is in the sample notation of UTS #18 Version 17 Section 1.1. From what I can make out, the corresponding Java notation would

Re: Corrigendum #9

2014-06-02 Thread David Starner
On Mon, Jun 2, 2014 at 8:48 AM, Markus Scherer markus@gmail.com wrote: Right, in principle. However, it should be ok to include noncharacters in CLDR data files for processing by CLDR implementations, and it should be possible to edit and diff and version-control and web-view those files

Re: Corrigendum #9

2014-06-02 Thread Markus Scherer
On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if used carelessly for their stated purpose, can probably trigger serious

Re: Corrigendum #9

2014-06-02 Thread Asmus Freytag
On 6/2/2014 2:53 PM, Markus Scherer wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com mailto:prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable

Re: Corrigendum #9

2014-06-02 Thread David Starner
On Mon, Jun 2, 2014 at 2:53 PM, Markus Scherer markus@gmail.com wrote: On Mon, Jun 2, 2014 at 1:32 PM, David Starner prosfil...@gmail.com wrote: I would especially discourage any web browser from handling these; they're noncharacters used for unknown purposes that are undisplayable and if

RE: Corrigendum #9

2014-06-02 Thread Shawn Steele
Ø I can't shake the suspicion that Corrigendum #9 is not actually solving a general problem, but is a special favor to CLDR as being run by insiders, and in the process muddying the waters for everyone else I think we could generalize to other scenarios so it wasn’t necessarily an insider

Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy
I better expect: treat them as you like, there will never be any warranty of interoperability, everyone is allowed to use them as they want and even change it at any time. The behavior is not defined in TUS, and users cannot expect that TUS will define this behavior. There's no clear solution

Re: Corrigendum #9

2014-06-02 Thread Philippe Verdy
reserved for CLDR would be wrong in TUS, you have reached a borderline where you are no longer handling plain text (stream of scalar values assigned to code points), but binary data via a binary interface outside TUS (handling streams of collation elements, whose representation is not even bound

Re: Corrigendum #9

2014-06-02 Thread Lisa Moore
I would like to point out to Asmus that this decision was reached unanimously at the UTC by Adobe, Apple, Google, IBM, Microsoft, SAP, UC Berkeley, and Yahoo! One might disagree with the decision, but there were no special favors involved. Lisa I can't shake the suspicion that

Re: Corrigendum #9

2014-06-02 Thread Richard Wordingham
On Mon, 2 Jun 2014 15:09:21 -0700 David Starner prosfil...@gmail.com wrote: So certain programs can't use noncharacters internally because some people want to interchange them? That doesn't seem like what noncharacters should be used for. Much as I don't like their uninvited use, it is