On 8/5/2016 10:07 AM, Markus Scherer wrote:
On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard <[email protected] <mailto:[email protected]>> wrote:

    What makes a character a "whitespace" in Unicode, e.g., why are
    ZWSP and ZWNBSP not "whitespace" even though they clearly say
    "SPACE" in them?


    Any implementation experience from other standards
    authors/implementers who have run into problems with shifty
    whitespace definitions?


Use properties, not character name patterns. If you have strong reasons not to use a property as-is, then still use it, just with inclusion & exclusion overrides.

Short answer: I cannot use character properties, and cannot use exclusion overrides.

As I have posted publicly, I am proposing some experimental Unicode-friendly extensions to IETF ABNF (currently in https://tools.ietf.org/html/draft-seantek-abnf-more-core-rules-05 , going to change that around a bit). There is (some) renewed interest in that part of the work since RFCs will permit UTF-8 in certain places, and IETF protocols are supposed to march towards "Net-Unicode" per RFC 5198.

Being a BNF, ABNF does not have exclusion, only incremental alternatives. Character properties would require a runtime library, which significantly goes against the purpose of (A)BNF.

The current proposed core rules have <UNICODE> (scalar values = doughnut hole for surrogates) and <BEYONDASCII> (scalar values without the ASCII range). While these are technically accurate, they will not be particularly useful for protocol designers as they are over-inclusive.

One of the rules I am working on is <UCHAR>, which is like <CHAR> except for Unicode. That eliminates the noncharacter code points (which, technically, are characters...that are defined as "not characters") as well as NULL, which is already eliminated by <CHAR>.

I was going to avoid extending <VCHAR> (which is U+0021-U+007E, i.e., no spaces and no control characters) because it's a bit too complicated. However, there are actual protocols, including a protocol that I am working on, that define parts of the repertoire as "graphic symbols and spacing characters", and elsewhere, "graphic symbols" (i.e., no spaces and no control characters). So the space characters are relevant at a level beneath requiring a full Unicode runtime to get at the character properties.

The newline issue is related but separate, and since IETF continues to use CRLF as the standard for interchange, I don't see a reason to touch it further.

Best regards,

Sean

Reply via email to