Re: Whitespace characters in Unicode

Markus Scherer Fri, 05 Aug 2016 12:42:30 -0700

On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard <[email protected]>
wrote:


> What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and
> ZWNBSP not "whitespace" even though they clearly say "SPACE" in them?
>

I think "white space" basically wants to have an advance width (occupy
space) but no ink (all white, no black)  :-)

ZWSP and ZWNBSP affect word and line breaking but have no advance width.

Note that character names can be misleading, plain wrong, or even just
misspelled, but they cannot be changed. Best to read the documentation. The
charts are a good start:
http://www.unicode.org/charts/PDF/U2000.pdf
http://www.unicode.org/charts/PDF/UFE70.pdf

In particular, don't build sets of Unicode characters just based on
character name patterns. Use character properties as much as possible.

What are "Unicode-y" ways to compute word boundaries?
>

http://www.unicode.org/reports/tr29/#Word_Boundaries

Related to prior question--I suppose ZWSP is not "whitespace", but like
> whitespace, it separates words. I suppose that since it is not printable,
> it is "confusing", and therefore should be avoided in contexts where the
> printed representation of Unicode code points matters.
>

Depends on what you do.

Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping and
line breaking in a browser or text field/editor.

They are not allowed in identifiers, and removed from domain names (UTS
#46).

Why is Pattern_White_Space significantly disjoint from White_Space, namely,
> why does Pattern_White_Space include LTRM and RTLM (and notably LS and PS)
> yet omit the spaces U+1680 and in the U+2000 range?
>

We wanted a simple, immutable definition for rule and pattern strings that
programmers write and maintain. We included LRM and RLM so that they can be
used (and will be ignored) in rules, for example collation rule strings, to
keep them moderately readable when they contain RTL characters. Typographic
spaces are unnecessary in this context, and could be confusing.

In hindsight, LS and PS are probably mistakes. When we came up
with Pattern_White_Space, we still liked the idea of unambiguous
end-of-line controls, but in practice it looks like no one really uses
them. Anyone who cares uses markup or rich-text formats. (Markup was not
common when Unicode was "born".)

Any implementation experience from other standards authors/implementers who
> have run into problems with shifty whitespace definitions?
>

Use properties, not character name patterns. If you have strong reasons not
to use a property as-is, then still use it, just with inclusion & exclusion
overrides.

Best regards,
markus

Re: Whitespace characters in Unicode

Reply via email to