Actually my apologies for my instinctive and quite rude answer, I've misunderstood the initial email thinking Sean was proposing extra whitespace for clarifications.
I won't react a quickly in the future, go on with your question Sean, and I hope you'll get it right. Best Regards On Thu, Aug 4, 2016 at 11:19 PM, Andrea Giammarchi < [email protected]> wrote: > I'm not a Unicode expert, but I couldn't stop thinking about the following > comic after reading "I am trying to come up with a sensible sets of > characters that are considered whitespace" https://xkcd.com/927/ > > Apologies for bringing pretty much nothing to this discussion but I'm > pretty sure there's much more to discuss in this ML than another whitespace > set on top of 25 characters already. > > Thanks for your patience and your understanding. > > Have a great weekend everyone! > Best Regards > > On Thu, Aug 4, 2016 at 10:28 PM, Leonardo Boiko <[email protected]> > wrote: > >> I'm sorry; I thought that, when you wanted to separate identifiers, it >> might be interesting to follow existing regexps definitions; this way your >> syntax would play along with already-existing tools (e.g. you'd be making >> it easy for someone to pipe your language into grep -P "\p{Whitespace}"). >> >> But I was talking out of my depth; I've never worked with defining >> Unicode identifiers, so I'm not really qualified to answer. I'm sure Davis >> and the others can give better answers to your questions. Meanwhile, I see >> that UAX #31 goes further into Unicode identifiers. It says that >> Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended >> for use in regexp-like "patterns" which mix literal characters, whitespace, >> and syntax (special characters), where the latter two would e.g. require >> quoting. For example, Perl has a "/x" flag which makes unquoted >> Pattern_White_Space characters be ignored in regexpes (so that you can make >> then less illegible). >> >> However, UAX #31 it also gives a Default Identifier Syntax, which bounds >> identifiers not by Whitespace but by their start characters, identified by >> ID_Start, defined like this: >> >> > ID_Start characters are derived from the Unicode General_Category of >> uppercase letters, lowercase letters, titlecase letters, modifier letters, >> other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax >> and Pattern_White_Space code points. >> >> So it makes reference only to Pattern_White_Space and not Whitespace. On >> the other hand, I guess the listing above will exclude Whitespace >> characters, since they don't count as any of letters, numbers, or >> Other_ID_Start? >> >> None of that is guaranteed to be stable, though. UAX #31 includes a >> separate definition for "Immutable identifiers", which are, and suggests >> various compromises between them. >> >> >> 2016-08-04 17:44 GMT-03:00 Sean Leonard <[email protected]>: >> >>> I read through TR18...it mainly says that <space> == \s == >>> \p{Whitespace} == property White_Space is true. Does it say anything else >>> or more significant than that, that I'm missing? >>> >>> Sean >>> >>> >>> On 8/4/2016 1:17 PM, Leonardo Boiko wrote: >>> >>> What Mark Davis said; also, depending on what you need, consider taking >>> a look at the definitions used by Unicode regexpes, at >>> http://unicode.org/reports/tr18/ . >>> >>> 2016-08-04 16:37 GMT-03:00 Sean Leonard <[email protected]>: >>> >>>> Hi Unicode Folks: >>>> >>>> I am trying to come up with a sensible sets of characters that are >>>> considered whitespace or newlines in Unicode, and to understand the >>>> relative stability policy with respect to them. (This is for a formal >>>> syntax where the definition of "whitespace" matters, e.g., to separate >>>> identifiers, and I want to be as conservative as possible.) Please let me >>>> know if the stuff below is correct, or needs work. >>>> >>>> The following characters / sequences are considered line breaking >>>> characters, per UAX #14 and Section 5.8 of UNICODE: >>>> >>>> CRLF CR LF FF VT NEL LS PS >>>> >>>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the >>>> combination U+000D U+000A (treated as one line break). These characters / >>>> sequences are called "newlines". >>>> >>>> There will not be any additional code points that are assigned to be >>>> line breaks. (Correct?) >>>> >>>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. >>>> These are distinguished from other codes (above) that also mean line >>>> breaks, mainly because of historical and widespread use of them. >>>> >>>> There are several formatting characters that affect word wrapping and >>>> line breaking, as discussed in those documents...but they are not line >>>> breaking characters. >>>> >>>> **** >>>> >>>> The following characters are whitespaces: characters (code points) with >>>> the property WSpace=Y (or White_Space). This is: >>>> >>>> newlines >>>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >>>> >>>> Assigned characters that are not listed above, can never be whitespace >>>> (according to Unicode). However, the set is not closed, so unassigned code >>>> points *could* be assigned to whitespace. It is (unlikely? very unlikely? >>>> Pretty much never going to happen?) that additional code points will be >>>> assigned to whitespace. >>>> >>>> **** >>>> >>>> There are some other characters that Unicode does not consider >>>> whitespace, but deserve discussion: >>>> U+180E MONGOLIAN VOWEL SEPARATOR: <https://codeblog.jonskeet.uk/ >>>> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-of >>>> -the-mongolian-vowel-separator/> >>>> <https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/> >>>> U+200B ZERO WIDTH SPACE >>>> U+200C ZERO WIDTH NON-JOINER >>>> U+200D ZERO WIDTH JOINER >>>> U+200E LEFT-TO-RIGHT MARK* >>>> U+200F RIGHT-TO-LEFT MARK* >>>> U+2060 WORD JOINER >>>> U+FEFF ZERO WIDTH NON-BREAKING SPACE >>>> >>>> *These appear in Pattern_White_Space, but Pattern_White_Space excludes >>>> U+2000-200A characters, which are obviously spaces. This is confusing and I >>>> would appreciate clarification *why* Pattern_White_Space is >>>> significantly disjoint from White_Space. >>>> >>>> ******** >>>> The borderline characters above are not considered WSpace=Y, but >>>> sometimes might have space-like properties. ZWP and ZWNBP are obviously >>>> "space" characters, but they never generate whitespace. I suppose that >>>> conversely LTRM and RTLM are obviously "not space" characters, but they >>>> could generate whitespace under certain circumstances. Ditto for other >>>> formatting characters in general (for which the class is much larger). >>>> >>>> Therefore I guess a Unicode definition of "whitespace" (or "space >>>> characters") is: an assigned code point that *always* (is supposed to) >>>> generates white space (empty space between graphemes). >>>> >>>> ******** >>>> >>>> Are there other standards that Unicode people recommend, that have >>>> addressed whether certain borderline characters are considered whitespace >>>> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax >>>> component)? >>>> >>>> Regards, >>>> >>>> Sean >>>> >>> >>> >>> >> >

