On 8/5/2016 10:07 AM, Markus Scherer wrote:
On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard
<[email protected] <mailto:[email protected]>> wrote:
What makes a character a "whitespace" in Unicode, e.g., why are
ZWSP and ZWNBSP not "whitespace" even though they clearly say
"SPACE" in them?
I think "white space" basically wants to have an advance width (occupy
space) but no ink (all white, no black) :-)
Yes, that is the thought that I had as well: whitespace characters
always generate blank space between graphemes, whether horizontal or
vertical.
ZWSP and ZWNBSP affect word and line breaking but have no advance width.
I suppose that these are "SPACE" characters, but not "WHITE space"
characters, since there is no white in them. :)
Note that character names can be misleading, plain wrong, or even just
misspelled, but they cannot be changed. Best to read the
documentation. The charts are a good start:
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf
In particular, don't build sets of Unicode characters just based on
character name patterns. Use character properties as much as possible.
What are "Unicode-y" ways to compute word boundaries?
http://www.unicode.org/reports/tr29/#Word_Boundaries
Related to prior question--I suppose ZWSP is not "whitespace", but
like whitespace, it separates words. I suppose that since it is
not printable, it is "confusing", and therefore should be avoided
in contexts where the printed representation of Unicode code
points matters.
Depends on what you do.
Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping
and line breaking in a browser or text field/editor.
They are not allowed in identifiers, and removed from domain names
(UTS #46).
Why is Pattern_White_Space significantly disjoint from
White_Space, namely, why does Pattern_White_Space include LTRM and
RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the
U+2000 range?
We wanted a simple, immutable definition for rule and pattern strings
that programmers write and maintain. We included LRM and RLM so that
they can be used (and will be ignored) in rules, for example collation
rule strings, to keep them moderately readable when they contain RTL
characters. Typographic spaces are unnecessary in this context, and
could be confusing.
In hindsight, LS and PS are probably mistakes. When we came up
with Pattern_White_Space, we still liked the idea of unambiguous
end-of-line controls, but in practice it looks like no one really uses
them. Anyone who cares uses markup or rich-text formats. (Markup was
not common when Unicode was "born".)
I like the premise of LS and PS: one (well, two) unambiguous characters
to rule them all. But the execution was lacking, to put it mildly. And
there aren't two keys on a common keyboard to distinguish between line
and paragraph separation.
On 8/6/2016 11:30 AM, Doug Ewell wrote:
Additionally, in UTF-8, either LS or PS actually takes more bytes than
CR plus LF, so the "increased text size" argument also discouraged use
of the new controls.
That is true, it takes 3 bytes. However, the original UTF-8 proposal
encoded U+0080 - U+207F in two octets: https://en.wikipedia.org/wiki/UTF-8 :
|10xxxxxx| |1xxxxxxx|
So, the space block /just barely makes it/. Was this intentional during
the original design of UTF-8, or just a coincidence? I think it was more
than a coincidence. It is regrettable that the space block was too high
to work in the final version of UTF-8...maybe it should have gone below
U+07FF.
(More motivation for my whitespace question in following message...)
Sean