[Syslog] MSG encoding and content (#3, #4, #5)

Rainer Gerhards Wed, 07 Dec 2005 06:31:27 -0800

Hi WG,

the topic of MSG encoding as well as its content (e.g. NUL and LF
characters) has not yet been solved. The past days, I've talked to a lot
of my friends not on this list and I have also looked at various ways to
solve the issue. Be prepared, this is another long mail, but I think it
is appropriate as this is our top issue left open. It is complex and it
requires a good amount of thinking, theory and arguments. I am trying to
convey a proposal and the facts it builds on in this mail.


Let's first quickly review what has been discussed on list:

- current implementations sometimes use LF as a record delimiter
- some implementations use LF inside the MSG part
- some implementations include binary data in syslog messages
  and would like to continue to do so (but these seem to be few)
- there are at least some use cases where a syslogd can not
  definitely detect the character encoding of a message
  (some of that might be related to the POSIX API, but there
  may be a work-around [I had no time yet to evaluate this
  in-depth]). It gets problematic if a message from a legacy
  sender is received (no encoding information) and transformed
  into a syslog-protocol message [I assume this is a valid use-case])
- previous discussion showed the need for Unicode. With Unicode, the 
  term "printable character" basically becomes useless, because there
  are so many non-printable characters in Unicode and new ones are
  potentially added constantly.
- previous consensus thus was that any valid UTF-8 string MUST
  be supported inside MSG (including NUL and LF)
- current discussion has shown that backwards-compatibility
  is not absolutely vital (but still desirable)
- it was suggested that an "encoding SD-ID" be defined which
  carries the character set definition
- as a side-note, Tom Petch has provided a very good digression
  on "character encoding" terminology which I have reproduced after
  my signature. I guess most people on this list already know the
  exact differences, but I still find it useful...


It is somewhat hard to find a good compromise. A compromise, in my point
of view, must allow the following:

- transforming existing messages into -protocol format should
  not intentionally be forbidden - transformation is a very
  important "feature" when it comes to deploying new technology
- new receivers should be able to precisely "enough" understand
  the message content
- I also find it advisable that newer receivers are capable
  to process both old-style and new-style messages concurrently. 
  While this is an implementation issue, it might be a hint for
  us that some subleties in character encoding must be dealt 
  with in any case.
- we should try NOT to include the myriad of possible encoding
  technologies, at least not promote this for needs other than
  backwards compatibility

To solve the encoding issue, an "encoding" SD-ID has been proposed that
describes the encoding of the MSG part (I do not use precise wording on
which encoding, simply because it is not relevant in this context - read
on...). This SD-ID would by its very nature be optional. I follow
Darren's reminder that truncation can always make SD-IDs (all or part)
disappear. As such, the encoding specification would not be guaranteed
to be received by the final destination. This contradicts with the
intension of that SD-ID: it's ultimate purpose was to enable the
receiver to use proper decoding for the MSG part.

Of course, this also raises the question if the SD-ID concept is good
enough. For obvious reasons it suffers from the lack of reliability. I
think this in general is acceptable. The only cure would be to bring
reliablity and thus full-duplex communication to syslog. This is way
beyond our charter (if you like this, you should probably join NETCONF
and help on NETCONF notifications). We have addressed this concern by
moving all absolutely vital data to the header. If we allow multiple
encodings, the information about the encoding belongs into the header,
so we would have another header field. While this is a solution, I think
it is overengineered for what we actually need.

Let us keep in mind that our ultimate desire is to have as many messages
as possible use Unicode (CCS) and be UTF-8 encoded (CES), with with
UTF-8 also being the transfer encoding (Tom: I hope I got it right ;)).
Any other encoding should only be supported for backward compatibility
either at the protocol level (transforming relays) or to leverage
existing APIs (POSIX et al). So we are accepting the fact that other
encodings need to be used, but we do not really like it (at least I
don't).

Assigning a header field for such a somewhat auxiluary feature would put
to much weight on it and may even promote its use.

So I am now back to the proposal with the Unicode BOM. Let's keep in
mind that we either a) know the character set [then we can convert to
Unicode] or b) we do not know it [then we can convey no information
about it, because else we would actually have case a)]. So a simple
indication whether or not MSG contains UTF-8 would be sufficient.

I hereby propose that we RECOMMEND to use UTF-8 in all cases where this
is possible. If UTF-8 is used, the MSG field MUST be prefixed by the
properly-encoded Unicode BOM (a 3-octet overhead). Any other encoding
MAY be used. In this case the MSG field MUST NOT start with the octet
values of the 3-octet UTF-8 encodede Unicode BOM. If necessary, a SP
MUST be inserted before this sequence. Such recommendations is within
the expectation of a typical Unicode user/developer (at least I strongly
think so).

The specification of other encodings, if there is an actual need for it,
should be left for a separate document. That document should specify how
to enhance syslog message content in a way inspired by MIME. I expect
such an document to make use of SD-IDs to acomplish its goal. That would
obviously again be subject to truncation. Here, I find this acceptable,
because

a) any -protocol compliant receiver would still be able to process the
message, at least in a basic way (thanks to the BOM)
b) specific maximum minimum size restrictions can be placed on compliant
receivers supporting such a specification

That "encoding" document should also address the natural
language/culture information, which I think we should not move into
-protocol.

If we assume the encoding is solved, we still have not decided on NUL,
LF and other US-ASCII control characters. If we look at existing syslog
implementations, most of them use LF control characters as a kind of
framing (End of Record - EOR - markers). Other control characters are
simply escaped. Plain binary data is very seldomly seen. NUL causes
confusion to many existing receivers.

We can now ask ourselfs: what problem does it cause if a sender sends a
control character (e.g. BEL) and a relay transforms it to an escaped
form (e.g. '^07'). If we follow this route, we see that there is nothing
bad with it per se. It becomes a problem only if a digital signature of
the message is transmitted (in the way syslog-sign intends to do).

IMPORTANT FINDING: There is no problem with message transformation
EXCEPT when the messages are digitally signed.

IMPORTANT OBSERVATION: we do not yet have digital signatures in syslog.

CONCLUSION: we do not need to care!

As it looks, we are trying to solve a problem that does not yet even
exist. And this not-yet-existing problem is the only issue that is
causing us us real grief here, especially if we look at backwards
compatibility. syslog-sign is still in draft state right now. It is free
to place further restrictions on whatever -protocol specifies. Of
course, it should not do this in an unexpected and unnecessray way. It
can be done quite non-intrusive, at least for the vast majority of
syslog data. Please read on, the simple solution will be below, but I
need to switch the topic back to syslog-protocol.

With all that said, I propose the following for the MSG field in
syslog-protocol (in regard to control characters):

MSG MAY contain any character including octets with values less then 32.
This is the US-ASCII control character range without DEL, which I
generally consider harmless. HOWEVER, it is RECOMMENDED that MSG does
NOT include any characters with octet values less then 32. This applies
to both UTF-8 encoded data as well as other data. If a syslog sender
uses octet values less than 32, it MUST expect that a receiver modifies
the message, which will lead to invalidation of eventually existing
digital signatures. If message transformation is not acceptable to the
sender, it MUST escape octet values less then 32 before sending the
message. All other Unicode control character sequences are not
considered extremely problematic, but are best avoided if no message
transformation is required. LF and NUL have no special meaning per se.
Most importantly, they do NOT indicate the end of the MSG field.

I think this proposal

a) provides an easy way to properly encode all currently-existing syslog
MSG content
b) provides guideline for new implementation
c) cautions against control character usage
d) levels ground for syslog-sign

While allowing everything, it tells the implementor what is bad.
Syslog-sign could then use the hint provided here and restrict
to-be-signed messages not to include the US-ASCII control character
range without any transfer encoding (like base64).

Think this proposal provides a backwards-compatibile and yet extensible
way to useful MSG content formatting.

Please let me know any objections you might have and, if so, please
precisely describe the problem you are seeing. Examples, external
references, and/or lab test results would be appreciated in those cases.

Many thanks,
Rainer

Tom Petch's Digression on "character encoding" terminology:
####
Character Set is a set of characters (letters, number, symbols, glyphs
...)
Coded Character Set [CCS] gives each a (numeric) code, as in ISO 10646.
Character Encoding (Scheme/Syntax) [CES] specifies how the codes become
octets as in
UTF-8.
Transfer Encoding/Syntax specifies how the octets are put on the wire,
as in
Base64.

MIME conflates CCS and CES to charset but keeps (Content) Transfer
Encoding
distinct; they can be different in different parts of an e-mail.
####

_______________________________________________
Syslog mailing list
[email protected]
https://www1.ietf.org/mailman/listinfo/syslog

[Syslog] MSG encoding and content (#3, #4, #5)

Reply via email to