Re: syslog Message Character Set

Alfonso De Gregorio Fri, 13 Oct 2000 22:24:07 -0700
On Fri, 13 Oct 2000, Chris Lonvick wrote:

Hi Everyone,

I apologize for the lag about this reply; rather busy. Sorry.
 
> PROPOSED for ver01:
>    The syslog message has traditionally contained ASCII alphanumerics 
>    and symbols.  The code set most often used has been seven-bit 
>    ASCII in an eight-bit field.  These are the ASCII codes as defined 
>    in "USA Standard Code for Information Interchange" [2] using codes 
>    32 through 126.  No indication of the code set used within the 
>    message is required, nor is it expected.  Other codes and code 
>    sets MAY be used.  The selection of a code set and codes used in
>    a message should be made with thoughts of the intended receiver.
>    A message containing characters in a code set that cannot be
>    viewed by a receiver will yield no information of value to an
>    operator or administrator looking at it.
> 
> Please look this over and let me know if this will cover our discussion
> appropriately.  An alternative would be to have a RECOMMENDED code set
> and list of codes.

An assumption has been made upon the octets taken by a single character.
For example, with Unicode we have three different forms:
UTF-8, UTF-16, and UTF-32 with characters respectively 8, 16 and 32 bit
long.

Potentially, ambiguities can arise if the client start to log in a not
eight bits  long encoding, and the log message analyzer read the messages
assuming eight bit long characters. (eg. a login name 8 bytes long can
be read as eight UTF-8, four UTF-16 and two UTF-32 characters).

The selection of a code set should be made with thoughts of the intended
receiver, but for an unforeseen event an administrator can start to use
a different log service. And this can lead some confunsion upon the
characterset used. 
Obviously, this not represent an ingrained protocol vulnerability, but
a potential ambiguity reason.

If we decide that indication of the used code set is not required, 
administrator should take a little precaution if logging to a centralized
log service. The log message parsing, infact, should be made in respeact
of the client that generated it.
Different senders can use heterogeneous code sets and no confusion should 
be made.

- The U+FFFD replacement character -
Obviously, if dealing with UTF-8, log message analyzer should have a 
'safe UTF-8 decoder' and reject overlong UTF-8 sequences.
This is perfectly solvable issue. Anyway a safe UTF-8 decoder will
substitute overlong sequences for which a shorter encoding exists, with 
a replacement character (the default is U+FFFD).

Log message analyzer should distinguish if a U+FFFD character is generated
from the decoder or from the sender.

ciao
alfonso

--
Alfonso De Gregorio,            [EMAIL PROTECTED]
Re: syslog Message Character Set

Reply via email to