Tom, I agree there are some issues with truncation, but I think they are inherent. We have specified that the message should be truncated at the end of the message. In the text I proposed, I wanted to make sure that the message ends with a technically-complete UTF-8 sequence. Based on Anton's comment, I have to admit I am unsure if there is really benefit in this. Anyhow, even if it is, I think we should not try to preserve the proper meaning. If the message is truncated, the end of it is in doubt. This might also mean a few characters at the end might be wrongly interpreted due to truncated control characters. I think we should document it and live with it (but it was important to bring this issue up so that it can be documented).
Any comments? Thanks, Rainer > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Tom Petch > Sent: Tuesday, January 17, 2006 2:40 PM > To: Darren Reed > Cc: [EMAIL PROTECTED] > Subject: Re: [Syslog] Sec 6.1: Truncation > > ----- Original Message ----- > From: "Darren Reed" <[EMAIL PROTECTED]> > To: "Tom Petch" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Monday, January 16, 2006 10:51 PM > Subject: Re: [Syslog] Sec 6.1: Truncation > > > > [ Charset ISO-8859-1 unsupported, converting... ] > > > Truncation of UTF-8 is actually slightly worse than has > been described. > > > > > > It is possible to determine from the UTF-8 octets where one coded > > > character ends and another begins. But because Unicode contains > > > combining characters, with no limit on how many of these there can > > > be, and these modify the meaning of previous or later > coded characters, > > > it is not possible to determine where one 'symbol' ends. > So truncation > > > at a UTF-8 boundary could subtlety change the meaning of > a message, > > > even breach security. Not something we can guard against > > > but should mention. > > > > The above seems a little confused to me. How can there be a problem > > if a message is truncated on the boundary of complex character ? > > > > Darren > > I lack the precise terminology. Unicode includes base > characters and modifying > characters, such as diacritic marks, as well as characters > that combine the two. > Where the combination exists as a single code point, no > problem. Where it does > not, then what the user would see as a single character is > actually sent as > several code points, each separately encoded in UTF-8. It is > fairly easy for a > truncating relay to work out the boundary of the UTF-8 and so > ensure that a > complete UTF-8 encoding is truncated (or not). It is much > harder, probably > impossible, to work out where any modifying characters > belong, whether they > should be removed or left in. And the character 'o' with a > diacritic mark is > not the same as that character without that diacritic mark, > so removing trailing > modifying characters changes the meaning, which could be a > security exposure. > . > Tom Petch > > > _______________________________________________ > Syslog mailing list > [email protected] > https://www1.ietf.org/mailman/listinfo/syslog > _______________________________________________ Syslog mailing list [email protected] https://www1.ietf.org/mailman/listinfo/syslog
