Rainer:

Good research.  I too did not realize there were control characters
outside of ASCII in Unicode.  The Unicode character range you referenced
had, for example, alternative line separator.  How many line separators
does the world need for God sake?  :)

I agree with your conclusion that we need to support all Unicode/UTF.  I
also think that doing any kind of escaping is generally bad and should
be deferred until it is absolutely necessary (like maybe escaping line
breaks for storage).

I guess this means, we can't have a line separator trailer unless we
escape all others inside of message.  I really would prefer no escaping.
I think alternatively, a UDP transport can define an optional/required
structured element for message length in octets, but it is tricky.

Anton.

> -----Original Message-----
> From: Rainer Gerhards [mailto:[EMAIL PROTECTED]
> Sent: Friday, February 06, 2004 6:24 AM
> To: Harrington, David; Anton Okmianski; [EMAIL PROTECTED]
> Subject: RE: -international: trailer
>
>
> David,
>
> thanks for your wake-up call...
>
> > I believe we should move to UTF-8 to allow operators who
>
> UTF-8 is actually a MUST in syslog-protocol.
>
> I have to admit that I did not fully understand UNICODE until
> now... I always read RFC 2279 (UTF-8 encoding). It specifies (page 2):
>
> - Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
>   correspond to octets 00 to 7F (7 bit US-ASCII values). A direct
>   consequence is that a plain ASCII string is also a valid UTF-8
>   string.
>
> - US-ASCII values do not appear otherwise in a UTF-8 encoded
>   character stream.  This provides compatibility with file systems
>   or other software (e.g. the printf() function in C libraries) that
>   parse based on US-ASCII values but are transparent to other
>   values.
>
> So I thought that control characters (US-ASCII below 0x20 &
> 0x7f) are only present in this range. I assumed, however,
> that UNICODE as such does not provide control characters  but
> only printable characters.
>
> Obviously, I am wrong. I did some more research this morning
> and found that at least pane 20xx does contain control characters:
>
http://www.unicode.org/charts/PDF/U2000.pdf

For a sample, see 0x200C and the characters following it. So my basic
assumption "just exclude US-ASCII control chars and you are done" is
wrong.

Having said this, I think we now have a bigger issue than I initially
thought. In the light of Unicode control characters, we are more or less
forced to allow any control characters inside the message part. If we
don't we can't comply with the (well thought-out) IETF Unicode
requirement (RFC 2277/BCP 18) for new RFCs.

As I wrote in my initial message, that not only affects -protocol, but
also -sign (though not to bad when it refers to -protocol for the format
description).

I think allowance for all character values affects also most of the
existing syslog software, as many work with C strings, where 0x00 is a
terminating character. I don't say it can't be dealt with in new
implementations. I just would like to mention that this will probably
get us a slow start, because the initial effort will be much higher for
an implementor - lot's of existing code could not be re-used.

Anyhow, I don't see an alternative to allowing all control characters.

I have included some Unicode links in my summary on this issue at
http://www.syslog.cc/ietf/protocol/issue9.html - this may be helpful for
others who need to dig a little into the Unicode requirement.

What does the rest of the WG think?

Rainer




Reply via email to