Re: Whitespace characters in Unicode

Sean Leonard Sun, 07 Aug 2016 16:31:17 -0700

On 8/5/2016 10:07 AM, Markus Scherer wrote:

On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard<[email protected] <mailto:[email protected]>> wrote:
    What makes a character a "whitespace" in Unicode, e.g., why are
    ZWSP and ZWNBSP not "whitespace" even though they clearly say
    "SPACE" in them?
I think "white space" basically wants to have an advance width (occupyspace) but no ink (all white, no black) :-)

Yes, that is the thought that I had as well: whitespace charactersalways generate blank space between graphemes, whether horizontal orvertical.


ZWSP and ZWNBSP affect word and line breaking but have no advance width.

I suppose that these are "SPACE" characters, but not "WHITE space"characters, since there is no white in them. :)

Note that character names can be misleading, plain wrong, or even justmisspelled, but they cannot be changed. Best to read thedocumentation. The charts are a good start:
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf
In particular, don't build sets of Unicode characters just based oncharacter name patterns. Use character properties as much as possible.
    What are "Unicode-y" ways to compute word boundaries?


http://www.unicode.org/reports/tr29/#Word_Boundaries

    Related to prior question--I suppose ZWSP is not "whitespace", but
    like whitespace, it separates words. I suppose that since it is
    not printable, it is "confusing", and therefore should be avoided
    in contexts where the printed representation of Unicode code
    points matters.


Depends on what you do.
Normal text needs ZWSP & ZWNBSP, for example for proper word wrappingand line breaking in a browser or text field/editor.
They are not allowed in identifiers, and removed from domain names(UTS #46).
    Why is Pattern_White_Space significantly disjoint from
    White_Space, namely, why does Pattern_White_Space include LTRM and
    RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the
    U+2000 range?
We wanted a simple, immutable definition for rule and pattern stringsthat programmers write and maintain. We included LRM and RLM so thatthey can be used (and will be ignored) in rules, for example collationrule strings, to keep them moderately readable when they contain RTLcharacters. Typographic spaces are unnecessary in this context, andcould be confusing.
In hindsight, LS and PS are probably mistakes. When we came upwith Pattern_White_Space, we still liked the idea of unambiguousend-of-line controls, but in practice it looks like no one really usesthem. Anyone who cares uses markup or rich-text formats. (Markup wasnot common when Unicode was "born".)

I like the premise of LS and PS: one (well, two) unambiguous charactersto rule them all. But the execution was lacking, to put it mildly. Andthere aren't two keys on a common keyboard to distinguish between lineand paragraph separation.


On 8/6/2016 11:30 AM, Doug Ewell wrote:

Additionally, in UTF-8, either LS or PS actually takes more bytes thanCR plus LF, so the "increased text size" argument also discouraged useof the new controls.

That is true, it takes 3 bytes. However, the original UTF-8 proposalencoded U+0080 - U+207F in two octets: https://en.wikipedia.org/wiki/UTF-8 :

|10xxxxxx|      |1xxxxxxx|

So, the space block /just barely makes it/. Was this intentional duringthe original design of UTF-8, or just a coincidence? I think it was morethan a coincidence. It is regrettable that the space block was too highto work in the final version of UTF-8...maybe it should have gone belowU+07FF.


(More motivation for my whitespace question in following message...)

Sean

Re: Whitespace characters in Unicode

Reply via email to