Re: Whitespace characters in Unicode

Sean Leonard Sun, 07 Aug 2016 17:23:20 -0700

On 8/5/2016 10:07 AM, Markus Scherer wrote:

On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard<[email protected] <mailto:[email protected]>> wrote:
    What makes a character a "whitespace" in Unicode, e.g., why are
    ZWSP and ZWNBSP not "whitespace" even though they clearly say
    "SPACE" in them?


    Any implementation experience from other standards
    authors/implementers who have run into problems with shifty
    whitespace definitions?
Use properties, not character name patterns. If you have strongreasons not to use a property as-is, then still use it, just withinclusion & exclusion overrides.

Short answer: I cannot use character properties, and cannot useexclusion overrides.

As I have posted publicly, I am proposing some experimentalUnicode-friendly extensions to IETF ABNF (currently inhttps://tools.ietf.org/html/draft-seantek-abnf-more-core-rules-05 ,going to change that around a bit). There is (some) renewed interest inthat part of the work since RFCs will permit UTF-8 in certain places,and IETF protocols are supposed to march towards "Net-Unicode" per RFC 5198.

Being a BNF, ABNF does not have exclusion, only incrementalalternatives. Character properties would require a runtime library,which significantly goes against the purpose of (A)BNF.

The current proposed core rules have <UNICODE> (scalar values = doughnuthole for surrogates) and <BEYONDASCII> (scalar values without the ASCIIrange). While these are technically accurate, they will not beparticularly useful for protocol designers as they are over-inclusive.

One of the rules I am working on is <UCHAR>, which is like <CHAR> exceptfor Unicode. That eliminates the noncharacter code points (which,technically, are characters...that are defined as "not characters") aswell as NULL, which is already eliminated by <CHAR>.

I was going to avoid extending <VCHAR> (which is U+0021-U+007E, i.e., nospaces and no control characters) because it's a bit too complicated.However, there are actual protocols, including a protocol that I amworking on, that define parts of the repertoire as "graphic symbols andspacing characters", and elsewhere, "graphic symbols" (i.e., no spacesand no control characters). So the space characters are relevant at alevel beneath requiring a full Unicode runtime to get at the characterproperties.

The newline issue is related but separate, and since IETF continues touse CRLF as the standard for interchange, I don't see a reason to touchit further.


Best regards,

Sean

Re: Whitespace characters in Unicode

Reply via email to