Re: Whitespace characters in Unicode

Sean Leonard Fri, 05 Aug 2016 08:58:58 -0700

Here are specific questions (perhaps Mark Davis, but anyone really withexperience, can respond):

As Mark said, there are 25 whitespace characters. (I forgot to includeHT, so that makes 25 from my original post.)

What makes a character a "whitespace" in Unicode, e.g., why are ZWSP andZWNBSP not "whitespace" even though they clearly say "SPACE" in them?

What are "Unicode-y" ways to compute word boundaries? Related to priorquestion--I suppose ZWSP is not "whitespace", but like whitespace, itseparates words. I suppose that since it is not printable, it is"confusing", and therefore should be avoided in contexts where theprinted representation of Unicode code points matters.

Why is Pattern_White_Space significantly disjoint from White_Space,namely, why does Pattern_White_Space include LTRM and RTLM (and notablyLS and PS) yet omit the spaces U+1680 and in the U+2000 range?

Any implementation experience from other standards authors/implementerswho have run into problems with shifty whitespace definitions?


Regards,

Sean

On 8/4/2016 2:28 PM, Leonardo Boiko wrote:

I'm sorry; I thought that, when you wanted to separate identifiers, itmight be interesting to follow existing regexps definitions; this wayyour syntax would play along with already-existing tools (e.g. you'dbe making it easy for someone to pipe your language into grep -P"\p{Whitespace}").

But I was talking out of my depth; I've never worked with definingUnicode identifiers, so I'm not really qualified to answer. I'm sureDavis and the others can give better answers to your questions.Meanwhile, I see that UAX #31 goes further into Unicode identifiers.It says that Pattern_White_Space is stable (unlike Whitespace,perhaps?), and intended for use in regexp-like "patterns" which mixliteral characters, whitespace, and syntax (special characters), wherethe latter two would e.g. require quoting. For example, Perl has a"/x" flag which makes unquoted Pattern_White_Space characters beignored in regexpes (so that you can make then less illegible).

However, UAX #31 it also gives a Default Identifier Syntax, whichbounds identifiers not by Whitespace but by their start characters,identified by ID_Start, defined like this:

|> ID_Start| characters are derived from the Unicode General_Categoryof uppercase letters, lowercase letters, titlecase letters, modifierletters, other letters, letter numbers, plus Other_ID_Start, minusPattern_Syntax and Pattern_White_Space code points.

So it makes reference only to Pattern_White_Space and not Whitespace.On the other hand, I guess the listing above will exclude Whitespacecharacters, since they don't count as any of letters, numbers, orOther_ID_Start?

None of that is guaranteed to be stable, though. UAX #31 includes aseparate definition for "Immutable identifiers", which are, andsuggests various compromises between them.

2016-08-04 17:44 GMT-03:00 Sean Leonard <[email protected]<mailto:[email protected]>>:


    I read through TR18...it mainly says that <space> == \s ==
    \p{Whitespace} == property White_Space is true. Does it say
    anything else or more significant than that, that I'm missing?

    Sean


    On 8/4/2016 1:17 PM, Leonardo Boiko wrote:

    What Mark Davis said; also, depending on what you need, consider
    taking a look at the definitions used by Unicode regexpes, at
    http://unicode.org/reports/tr18/ <http://unicode.org/reports/tr18/> .

    2016-08-04 16:37 GMT-03:00 Sean Leonard
    <[email protected] <mailto:[email protected]>>:

        Hi Unicode Folks:

        I am trying to come up with a sensible sets of characters
        that are considered whitespace or newlines in Unicode, and to
        understand the relative stability policy with respect to
        them. (This is for a formal syntax where the definition of
        "whitespace" matters, e.g., to separate identifiers, and I
        want to be as conservative as possible.) Please let me know
        if the stuff below is correct, or needs work.

        The following characters / sequences are considered line
        breaking characters, per UAX #14 and Section 5.8 of UNICODE:

        CRLF CR LF FF VT NEL LS PS

        So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the
        combination U+000D U+000A (treated as one line break). These
        characters / sequences are called "newlines".

        There will not be any additional code points that are
        assigned to be line breaks. (Correct?)

        CRLF, CR, LF, and NEL are also considered "newline functions"
        or NLF. These are distinguished from other codes (above) that
        also mean line breaks, mainly because of historical and
        widespread use of them.

        There are several formatting characters that affect word
        wrapping and line breaking, as discussed in those
        documents...but they are not line breaking characters.

        ****

        The following characters are whitespaces: characters (code
        points) with the property WSpace=Y (or White_Space). This is:

        newlines
        U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000

        Assigned characters that are not listed above, can never be
        whitespace (according to Unicode). However, the set is not
        closed, so unassigned code points *could* be assigned to
        whitespace. It is (unlikely? very unlikely? Pretty much never
        going to happen?) that additional code points will be
        assigned to whitespace.

        ****

        There are some other characters that Unicode does not
        consider whitespace, but deserve discussion:
        U+180E MONGOLIAN VOWEL SEPARATOR:
        
<https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
        
<https://codeblog.jonskeet.uk/2014/12/01/when-is-an-identifier-not-an-identifier-attack-of-the-mongolian-vowel-separator/>
        U+200B ZERO WIDTH SPACE
        U+200C ZERO WIDTH NON-JOINER
        U+200D ZERO WIDTH JOINER
        U+200E LEFT-TO-RIGHT MARK*
        U+200F RIGHT-TO-LEFT MARK*
        U+2060 WORD JOINER
        U+FEFF ZERO WIDTH NON-BREAKING SPACE

        *These appear in Pattern_White_Space, but Pattern_White_Space
        excludes U+2000-200A characters, which are obviously spaces.
        This is confusing and I would appreciate clarification /why/
        Pattern_White_Space is significantly disjoint from White_Space.

        ********
        The borderline characters above are not considered WSpace=Y,
        but sometimes might have space-like properties. ZWP and ZWNBP
        are obviously "space" characters, but they never generate
        whitespace. I suppose that conversely LTRM and RTLM are
        obviously "not space" characters, but they could generate
        whitespace under certain circumstances. Ditto for other
        formatting characters in general (for which the class is much
        larger).

        Therefore I guess a Unicode definition of "whitespace" (or
        "space characters") is: an assigned code point that *always*
        (is supposed to) generates white space (empty space between
        graphemes).

        ********

        Are there other standards that Unicode people recommend, that
        have addressed whether certain borderline characters are
        considered whitespace vs. non-whitespace (e.g., possibly
        acceptable as an identifier or syntax component)?

        Regards,

        Sean

Re: Whitespace characters in Unicode

Reply via email to