On 8/5/2016 10:07 AM, Markus Scherer wrote:
On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard <[email protected] <mailto:[email protected]>> wrote:

    What makes a character a "whitespace" in Unicode, e.g., why are
    ZWSP and ZWNBSP not "whitespace" even though they clearly say
    "SPACE" in them?


I think "white space" basically wants to have an advance width (occupy space) but no ink (all white, no black) :-)

Yes, that is the thought that I had as well: whitespace characters always generate blank space between graphemes, whether horizontal or vertical.


ZWSP and ZWNBSP affect word and line breaking but have no advance width.

I suppose that these are "SPACE" characters, but not "WHITE space" characters, since there is no white in them. :)


Note that character names can be misleading, plain wrong, or even just misspelled, but they cannot be changed. Best to read the documentation. The charts are a good start:
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf

In particular, don't build sets of Unicode characters just based on character name patterns. Use character properties as much as possible.

    What are "Unicode-y" ways to compute word boundaries?


http://www.unicode.org/reports/tr29/#Word_Boundaries

    Related to prior question--I suppose ZWSP is not "whitespace", but
    like whitespace, it separates words. I suppose that since it is
    not printable, it is "confusing", and therefore should be avoided
    in contexts where the printed representation of Unicode code
    points matters.


Depends on what you do.

Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping and line breaking in a browser or text field/editor.

They are not allowed in identifiers, and removed from domain names (UTS #46).

    Why is Pattern_White_Space significantly disjoint from
    White_Space, namely, why does Pattern_White_Space include LTRM and
    RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the
    U+2000 range?


We wanted a simple, immutable definition for rule and pattern strings that programmers write and maintain. We included LRM and RLM so that they can be used (and will be ignored) in rules, for example collation rule strings, to keep them moderately readable when they contain RTL characters. Typographic spaces are unnecessary in this context, and could be confusing.

In hindsight, LS and PS are probably mistakes. When we came up with Pattern_White_Space, we still liked the idea of unambiguous end-of-line controls, but in practice it looks like no one really uses them. Anyone who cares uses markup or rich-text formats. (Markup was not common when Unicode was "born".)

I like the premise of LS and PS: one (well, two) unambiguous characters to rule them all. But the execution was lacking, to put it mildly. And there aren't two keys on a common keyboard to distinguish between line and paragraph separation.

On 8/6/2016 11:30 AM, Doug Ewell wrote:
Additionally, in UTF-8, either LS or PS actually takes more bytes than CR plus LF, so the "increased text size" argument also discouraged use of the new controls.

That is true, it takes 3 bytes. However, the original UTF-8 proposal encoded U+0080 - U+207F in two octets: https://en.wikipedia.org/wiki/UTF-8 :
|10xxxxxx|      |1xxxxxxx|


So, the space block /just barely makes it/. Was this intentional during the original design of UTF-8, or just a coincidence? I think it was more than a coincidence. It is regrettable that the space block was too high to work in the final version of UTF-8...maybe it should have gone below U+07FF.

(More motivation for my whitespace question in following message...)

Sean

Reply via email to