Hi WG, the topic of MSG encoding as well as its content (e.g. NUL and LF characters) has not yet been solved. The past days, I've talked to a lot of my friends not on this list and I have also looked at various ways to solve the issue. Be prepared, this is another long mail, but I think it is appropriate as this is our top issue left open. It is complex and it requires a good amount of thinking, theory and arguments. I am trying to convey a proposal and the facts it builds on in this mail.
Let's first quickly review what has been discussed on list: - current implementations sometimes use LF as a record delimiter - some implementations use LF inside the MSG part - some implementations include binary data in syslog messages and would like to continue to do so (but these seem to be few) - there are at least some use cases where a syslogd can not definitely detect the character encoding of a message (some of that might be related to the POSIX API, but there may be a work-around [I had no time yet to evaluate this in-depth]). It gets problematic if a message from a legacy sender is received (no encoding information) and transformed into a syslog-protocol message [I assume this is a valid use-case]) - previous discussion showed the need for Unicode. With Unicode, the term "printable character" basically becomes useless, because there are so many non-printable characters in Unicode and new ones are potentially added constantly. - previous consensus thus was that any valid UTF-8 string MUST be supported inside MSG (including NUL and LF) - current discussion has shown that backwards-compatibility is not absolutely vital (but still desirable) - it was suggested that an "encoding SD-ID" be defined which carries the character set definition - as a side-note, Tom Petch has provided a very good digression on "character encoding" terminology which I have reproduced after my signature. I guess most people on this list already know the exact differences, but I still find it useful... It is somewhat hard to find a good compromise. A compromise, in my point of view, must allow the following: - transforming existing messages into -protocol format should not intentionally be forbidden - transformation is a very important "feature" when it comes to deploying new technology - new receivers should be able to precisely "enough" understand the message content - I also find it advisable that newer receivers are capable to process both old-style and new-style messages concurrently. While this is an implementation issue, it might be a hint for us that some subleties in character encoding must be dealt with in any case. - we should try NOT to include the myriad of possible encoding technologies, at least not promote this for needs other than backwards compatibility To solve the encoding issue, an "encoding" SD-ID has been proposed that describes the encoding of the MSG part (I do not use precise wording on which encoding, simply because it is not relevant in this context - read on...). This SD-ID would by its very nature be optional. I follow Darren's reminder that truncation can always make SD-IDs (all or part) disappear. As such, the encoding specification would not be guaranteed to be received by the final destination. This contradicts with the intension of that SD-ID: it's ultimate purpose was to enable the receiver to use proper decoding for the MSG part. Of course, this also raises the question if the SD-ID concept is good enough. For obvious reasons it suffers from the lack of reliability. I think this in general is acceptable. The only cure would be to bring reliablity and thus full-duplex communication to syslog. This is way beyond our charter (if you like this, you should probably join NETCONF and help on NETCONF notifications). We have addressed this concern by moving all absolutely vital data to the header. If we allow multiple encodings, the information about the encoding belongs into the header, so we would have another header field. While this is a solution, I think it is overengineered for what we actually need. Let us keep in mind that our ultimate desire is to have as many messages as possible use Unicode (CCS) and be UTF-8 encoded (CES), with with UTF-8 also being the transfer encoding (Tom: I hope I got it right ;)). Any other encoding should only be supported for backward compatibility either at the protocol level (transforming relays) or to leverage existing APIs (POSIX et al). So we are accepting the fact that other encodings need to be used, but we do not really like it (at least I don't). Assigning a header field for such a somewhat auxiluary feature would put to much weight on it and may even promote its use. So I am now back to the proposal with the Unicode BOM. Let's keep in mind that we either a) know the character set [then we can convert to Unicode] or b) we do not know it [then we can convey no information about it, because else we would actually have case a)]. So a simple indication whether or not MSG contains UTF-8 would be sufficient. I hereby propose that we RECOMMEND to use UTF-8 in all cases where this is possible. If UTF-8 is used, the MSG field MUST be prefixed by the properly-encoded Unicode BOM (a 3-octet overhead). Any other encoding MAY be used. In this case the MSG field MUST NOT start with the octet values of the 3-octet UTF-8 encodede Unicode BOM. If necessary, a SP MUST be inserted before this sequence. Such recommendations is within the expectation of a typical Unicode user/developer (at least I strongly think so). The specification of other encodings, if there is an actual need for it, should be left for a separate document. That document should specify how to enhance syslog message content in a way inspired by MIME. I expect such an document to make use of SD-IDs to acomplish its goal. That would obviously again be subject to truncation. Here, I find this acceptable, because a) any -protocol compliant receiver would still be able to process the message, at least in a basic way (thanks to the BOM) b) specific maximum minimum size restrictions can be placed on compliant receivers supporting such a specification That "encoding" document should also address the natural language/culture information, which I think we should not move into -protocol. If we assume the encoding is solved, we still have not decided on NUL, LF and other US-ASCII control characters. If we look at existing syslog implementations, most of them use LF control characters as a kind of framing (End of Record - EOR - markers). Other control characters are simply escaped. Plain binary data is very seldomly seen. NUL causes confusion to many existing receivers. We can now ask ourselfs: what problem does it cause if a sender sends a control character (e.g. BEL) and a relay transforms it to an escaped form (e.g. '^07'). If we follow this route, we see that there is nothing bad with it per se. It becomes a problem only if a digital signature of the message is transmitted (in the way syslog-sign intends to do). IMPORTANT FINDING: There is no problem with message transformation EXCEPT when the messages are digitally signed. IMPORTANT OBSERVATION: we do not yet have digital signatures in syslog. CONCLUSION: we do not need to care! As it looks, we are trying to solve a problem that does not yet even exist. And this not-yet-existing problem is the only issue that is causing us us real grief here, especially if we look at backwards compatibility. syslog-sign is still in draft state right now. It is free to place further restrictions on whatever -protocol specifies. Of course, it should not do this in an unexpected and unnecessray way. It can be done quite non-intrusive, at least for the vast majority of syslog data. Please read on, the simple solution will be below, but I need to switch the topic back to syslog-protocol. With all that said, I propose the following for the MSG field in syslog-protocol (in regard to control characters): MSG MAY contain any character including octets with values less then 32. This is the US-ASCII control character range without DEL, which I generally consider harmless. HOWEVER, it is RECOMMENDED that MSG does NOT include any characters with octet values less then 32. This applies to both UTF-8 encoded data as well as other data. If a syslog sender uses octet values less than 32, it MUST expect that a receiver modifies the message, which will lead to invalidation of eventually existing digital signatures. If message transformation is not acceptable to the sender, it MUST escape octet values less then 32 before sending the message. All other Unicode control character sequences are not considered extremely problematic, but are best avoided if no message transformation is required. LF and NUL have no special meaning per se. Most importantly, they do NOT indicate the end of the MSG field. I think this proposal a) provides an easy way to properly encode all currently-existing syslog MSG content b) provides guideline for new implementation c) cautions against control character usage d) levels ground for syslog-sign While allowing everything, it tells the implementor what is bad. Syslog-sign could then use the hint provided here and restrict to-be-signed messages not to include the US-ASCII control character range without any transfer encoding (like base64). Think this proposal provides a backwards-compatibile and yet extensible way to useful MSG content formatting. Please let me know any objections you might have and, if so, please precisely describe the problem you are seeing. Examples, external references, and/or lab test results would be appreciated in those cases. Many thanks, Rainer Tom Petch's Digression on "character encoding" terminology: #### Character Set is a set of characters (letters, number, symbols, glyphs ...) Coded Character Set [CCS] gives each a (numeric) code, as in ISO 10646. Character Encoding (Scheme/Syntax) [CES] specifies how the codes become octets as in UTF-8. Transfer Encoding/Syntax specifies how the octets are put on the wire, as in Base64. MIME conflates CCS and CES to charset but keeps (Content) Transfer Encoding distinct; they can be different in different parts of an e-mail. #### _______________________________________________ Syslog mailing list [email protected] https://www1.ietf.org/mailman/listinfo/syslog
