I read through TR18...it mainly says that <space> == \s == \p{Whitespace} == property White_Space is true. Does it say anything else or more significant than that, that I'm missing?

Sean

On 8/4/2016 1:17 PM, Leonardo Boiko wrote:
What Mark Davis said; also, depending on what you need, consider taking a look at the definitions used by Unicode regexpes, at http://unicode.org/reports/tr18/ .

2016-08-04 16:37 GMT-03:00 Sean Leonard <[email protected] <mailto:[email protected]>>:

    Hi Unicode Folks:

    I am trying to come up with a sensible sets of characters that are
    considered whitespace or newlines in Unicode, and to understand
    the relative stability policy with respect to them. (This is for a
    formal syntax where the definition of "whitespace" matters, e.g.,
    to separate identifiers, and I want to be as conservative as
    possible.) Please let me know if the stuff below is correct, or
    needs work.

    The following characters / sequences are considered line breaking
    characters, per UAX #14 and Section 5.8 of UNICODE:

    CRLF CR LF FF VT NEL LS PS

    So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the
    combination U+000D U+000A (treated as one line break). These
    characters / sequences are called "newlines".

    There will not be any additional code points that are assigned to
    be line breaks. (Correct?)

    CRLF, CR, LF, and NEL are also considered "newline functions" or
    NLF. These are distinguished from other codes (above) that also
    mean line breaks, mainly because of historical and widespread use
    of them.

    There are several formatting characters that affect word wrapping
    and line breaking, as discussed in those documents...but they are
    not line breaking characters.

    ****

    The following characters are whitespaces: characters (code points)
    with the property WSpace=Y (or White_Space). This is:

    newlines
    U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000

    Assigned characters that are not listed above, can never be
    whitespace (according to Unicode). However, the set is not closed,
    so unassigned code points *could* be assigned to whitespace. It is
    (unlikely? very unlikely? Pretty much never going to happen?) that
    additional code points will be assigned to whitespace.

    ****

    There are some other characters that Unicode does not consider
    whitespace, but deserve discussion:
    U+180E MONGOLIAN VOWEL SEPARATOR:
    
<https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
    
<https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
    U+200B ZERO WIDTH SPACE
    U+200C ZERO WIDTH NON-JOINER
    U+200D ZERO WIDTH JOINER
    U+200E LEFT-TO-RIGHT MARK*
    U+200F RIGHT-TO-LEFT MARK*
    U+2060 WORD JOINER
    U+FEFF ZERO WIDTH NON-BREAKING SPACE

    *These appear in Pattern_White_Space, but Pattern_White_Space
    excludes U+2000-200A characters, which are obviously spaces. This
    is confusing and I would appreciate clarification /why/
    Pattern_White_Space is significantly disjoint from White_Space.

    ********
    The borderline characters above are not considered WSpace=Y, but
    sometimes might have space-like properties. ZWP and ZWNBP are
    obviously "space" characters, but they never generate whitespace.
    I suppose that conversely LTRM and RTLM are obviously "not space"
    characters, but they could generate whitespace under certain
    circumstances. Ditto for other formatting characters in general
    (for which the class is much larger).

    Therefore I guess a Unicode definition of "whitespace" (or "space
    characters") is: an assigned code point that *always* (is supposed
    to) generates white space (empty space between graphemes).

    ********

    Are there other standards that Unicode people recommend, that have
    addressed whether certain borderline characters are considered
    whitespace vs. non-whitespace (e.g., possibly acceptable as an
    identifier or syntax component)?

    Regards,

    Sean



Reply via email to