On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard <[email protected]> wrote:
> What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and > ZWNBSP not "whitespace" even though they clearly say "SPACE" in them? > I think "white space" basically wants to have an advance width (occupy space) but no ink (all white, no black) :-) ZWSP and ZWNBSP affect word and line breaking but have no advance width. Note that character names can be misleading, plain wrong, or even just misspelled, but they cannot be changed. Best to read the documentation. The charts are a good start: http://www.unicode.org/charts/PDF/U2000.pdf http://www.unicode.org/charts/PDF/UFE70.pdf In particular, don't build sets of Unicode characters just based on character name patterns. Use character properties as much as possible. What are "Unicode-y" ways to compute word boundaries? > http://www.unicode.org/reports/tr29/#Word_Boundaries Related to prior question--I suppose ZWSP is not "whitespace", but like > whitespace, it separates words. I suppose that since it is not printable, > it is "confusing", and therefore should be avoided in contexts where the > printed representation of Unicode code points matters. > Depends on what you do. Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping and line breaking in a browser or text field/editor. They are not allowed in identifiers, and removed from domain names (UTS #46). Why is Pattern_White_Space significantly disjoint from White_Space, namely, > why does Pattern_White_Space include LTRM and RTLM (and notably LS and PS) > yet omit the spaces U+1680 and in the U+2000 range? > We wanted a simple, immutable definition for rule and pattern strings that programmers write and maintain. We included LRM and RLM so that they can be used (and will be ignored) in rules, for example collation rule strings, to keep them moderately readable when they contain RTL characters. Typographic spaces are unnecessary in this context, and could be confusing. In hindsight, LS and PS are probably mistakes. When we came up with Pattern_White_Space, we still liked the idea of unambiguous end-of-line controls, but in practice it looks like no one really uses them. Anyone who cares uses markup or rich-text formats. (Markup was not common when Unicode was "born".) Any implementation experience from other standards authors/implementers who > have run into problems with shifty whitespace definitions? > Use properties, not character name patterns. If you have strong reasons not to use a property as-is, then still use it, just with inclusion & exclusion overrides. Best regards, markus

