Re: [Syslog] #5 - character encoding (was: Consensus?)

2005-12-01 Thread Tom Petch
Rainer

I think I detect an approach I do not agree with, in this and perhaps other
issues.

You seem to be saying that the (eg POSIX) syslogd must emit perfect syslog
messages and is responsible for anything that is wrong with them no matter what
it received from the application (I exaggerate slightly).

I would say that if the application passes incomprehensible garbage, something
criminal or illegal, then it is the application that is at fault; syslogd can
only be held responsible if it produces messages that are invalid for the parts
over which it has control, eg header syntax.

So if syslogd has no idea what the transfer encoding is because the rest of the
system does not tell it, then syslogd cannot be held responsible for the absence
of a field saying what the transfer encoding actually is.  Or put differently,
if our RFC specify what the application MUST or SHOULD do, as well as syslogd,
then that is ok with me.

What syslogd would be responsible for, IMO, would be allowing characters that
have a special meaning in the syntax (eg NUL is end of message) appearing
unescaped (or otherwise encoded).  Whether we have such problems depends on the
resolution of other issues, not saying that we have at present.

Tom Petch

- Original Message -
From: Rainer Gerhards [EMAIL PROTECTED]
To: Chris Lonvick [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, November 30, 2005 2:48 PM
Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)


Chris,

I fully agree - thanks ;)

Rainer

 -Original Message-
 From: Chris Lonvick [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, November 30, 2005 2:39 PM
 To: Rainer Gerhards
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)

 Hi Rainer,

 I believe that we are saying the same thing.  :)

 If there is no indicator of encoding or language then a
 reciever will not
 know what it is receiving - just like receivers don't know
 what they are
 receiving today.  They MAY make an assumption that it is something in
 US-ASCII (but may be disappointed).

 If there is an indicator of the encoding and language then
 the receiver
 will know exactly what it is.  Having an indicator should be
 RECOMMENDED
 but not REQUIRED for ease of migration.

 Is that what we're all saying?

 Thanks,
 Chris



 On Wed, 30 Nov 2005, Rainer Gerhards wrote:

  Chris,
 
  Let's use this email as an example.  :)  There is no
  indication that I'm
  using US-ASCII encoding or that I'm writing in English.
 
  I think there actually is. If I am right, the SMTP RFCs
 require mail text to be US-ASCII. Only via MIME and/or escape
 characters you can include 8-bit data. For example Müller and
 Möller might create some problems in some mailers (But I
 guess my Mail system will encode them with =hexval).
 Dropping messages with octets  127 in the subject is a
 common spam protection setting...
 
  However, you're
  able to recieve this and read it.  Similarly, you could write
  an email in
  German and send it to me.  I would still be able to recieve
  it but I'd
  have a difficult time parsing the meaning.
 
  I'm suggesting that same approach for the transmission of
 the syslog
  content.  If I really wanted you to know what encoding and
  language I'm
  using in an email, I would specify a mime header.  syslog
  senders will
  continue to pump out whatever encoding and language they've
  been using
  and recievers will continue to do their best to parse them.
  If a vendor
  wants to get very specific about that, then they will have to
  use an SD-ID
  to identify the contents of the message.
 
  Here I agree with you. What I was saying is that IF the
 header says it is US-ASCII, only then we should assume it
 actually is. If there is no enc SD-ID, then we do not know
 what it is but can assume ... whatever we assume. Let me
 phrase it that way:
 
  If the message contains
 
  [enc=us-ascii lang=en]
 
  then the receiver can honestly expect it to be US-ASCII.
 But if it does not contain any enc the receiver does not
 know exactly and assume anything it finds useful (may be
 ASCII, may not).
 
  Does this clarify? I somehow have the impression we mean
 the same thing and I simply do not manage to convey what I
 intend to ;)
 
  Rainer
 
 
  Mit Aufrichtigkeit,
  Chris
 
 
 
 
  On Wed, 30 Nov 2005, Rainer Gerhards wrote:
 
  Andrew,
 
  Hi Rainer,
 
  Why don't we look at it from the other direction?  We could
  state that any
  encoding is acceptable - for ease-of-use/migration with
  existing syslog
  implementations.  It is RECOMMENDED that UTF-8 be used.
  When it is
  used, an SD-ID element will be REQUIRED.  e.g. -
  [enc=utf-8 lang=en]
 
  I like that idea too.
 
  So, if no SD-ID encoding element is specified, then we must
  assume US-ASCII
  and deal with it accordingly??
 
  I think not. If it is not present, we known that we do not
  know it. If
  it is US-ASCII, I would expect something like
 
  [enc

RE: [Syslog] #5 - character encoding (was: Consensus?)

2005-11-30 Thread Rainer Gerhards
Sheran, 

 Also want to clarify that you suggest that if the message is in ASCII,
 it will not required SD-ID, but for all other encodings, SD-ID will be
 required.

Unfortunately, we can not do this. If we would know the encoding, we
could translate it to UTF-8, as so far is required by syslog-protocol.
However, we often do not know which encoding it is. The reason is that
the POSIX syslog API does not tell us. So if we want to support POSIX
(which I think we must), we must allow a syslog sender to send messages
without telling the encoding - simply because it has no way to obtain
that knowledge.

A syslog sender embedded e.g. in a device does probably not have this
restriction. So it SHOULD encode in UTF-8. That will ensure the receiver
can understand it. If the sender has absolutely no idea of how to do
that, but knows the encoding, then (and only then) it SHOULD specify the
encoding.

Rainer

 
 Note most other encoding methods already imply the language used, for
 example, in Chinese, there are several encoding methods, Traditional
 Chinese used in Taiwan and Hong Kong is Big5, and simplified Chinese
 used in Mainland China is GBK, so if the message is in traditional
 Chinese char, it will be shown as [enc=Big5, lang=Traditional
 Chinese], a little bit redundant. The Big5 also includes all English
 char so it can be a mix of Chinese and English.  
 
 
 
 Regards,
  
 Sheran
 
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick
 (clonvick)
 Sent: Tuesday, November 29, 2005 10:22 AM
 To: Rainer Gerhards
 Cc: [EMAIL PROTECTED]
 Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)
 
 Hi Rainer,
 
 Why don't we look at it from the other direction?  We could state that
 any encoding is acceptable - for ease-of-use/migration with existing
 syslog implementations.  It is RECOMMENDED that UTF-8 be 
 used.  When it
 is used, an SD-ID element will be REQUIRED.  e.g. - [enc=utf-8
 lang=en]
 
 Thoughts?
 
 All:  Let's discuss this and close this issue.
 
 Thanks,
 Chris
 
 On Tue, 29 Nov 2005, Rainer Gerhards wrote:
 
  Chris  WG,
 
  #5 Character encoding in MSG: due to my proof-of-concept
implementation, I have raised the (ugly) question if we need
to allow encodings other than UTF-8. Please note that this
question arises from needs introduced by e.g. POSIX. So we
can't easily argue them away by whishful thinking ;)
 
  Not even discussed yet.
 
  I haven't reviewed that yet.  However, I'll note that allowing 
  different encoding can be accomplished in the future as long as we 
  establish a default encoding and a way to identify it in 
 our current 
  work.
 
  I have read a little in the mailing archive. Please note 
 that in 2000 
  it was consensus that the MSG part may contain encodings other then 
  US-ASCII. Follow this threat:
 
  http://www.syslog.cc/ietf/autoarc/msg00127.html
 
  This discussion lead to RFC 3164 saying other encodings 
 MAY be used.
  While this was observed behaviour, we need still to be 
 aware that the 
  POSIX (and glibc) API places the restrictions on us that we 
 simply do 
  not know the character encoding used by the application. As 
 such, no 
  *nix syslogd can be programmed to be compliant to 
 syslog-protocol if 
  we demand UTF-8 exclusively.
 
  I propose that we RECOMMEND UTF-8 that MUST start with the Unicode 
  Byte Order Mask (BOM) if used. If the MSG part does not 
 start with the
 
  BOM, it may be any encoding just as in RFC 3164. I do not see any 
  alternative to this.
 
  Rainer
 
  ___
  Syslog mailing list
  Syslog@lists.ietf.org
  https://www1.ietf.org/mailman/listinfo/syslog
 
 
 ___
 Syslog mailing list
 Syslog@lists.ietf.org
 https://www1.ietf.org/mailman/listinfo/syslog
 
 ___
 Syslog mailing list
 Syslog@lists.ietf.org
 https://www1.ietf.org/mailman/listinfo/syslog
 

___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog


RE: [Syslog] #5 - character encoding (was: Consensus?)

2005-11-30 Thread Andrew Ross

Hi Rainer,

Why don't we look at it from the other direction?  We could state that any 
encoding is acceptable - for ease-of-use/migration with existing syslog 
implementations.  It is RECOMMENDED that UTF-8 be used.  When it is 
used, an SD-ID element will be REQUIRED.  e.g. - [enc=utf-8 lang=en]

I like that idea too.

So, if no SD-ID encoding element is specified, then we must assume US-ASCII
and deal with it accordingly??

Cheers

Andrew




___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog


RE: [Syslog] #5 - character encoding (was: Consensus?)

2005-11-30 Thread Chris Lonvick

Hi Sheran,

On Tue, 29 Nov 2005, Shyyunn Lin (sheranl) wrote:


Chris:

I think having SD-ID with [enc=utf-8 lang=English] may be a good
approach. If different language use utf-8 encoding, then lang= can
distinguish it.


We _should_ be using language codes from RFC 3066.  That specifies ISO 639 
language tags.  639-1 has 2 character codes (en is English) and 639-2 
has 3 characters (eng is English).  RFC 3066 will likely be replaced by 
the works of the Language Tag Registry Update (ltru) Working Group.

  http://www.ietf.org/html.charters/ltru-charter.html
They have IDs in the works.  Until those become RFCs we should continue to 
reference RFC 3066.




Also want to clarify that you suggest that if the message is in ASCII,
it will not required SD-ID, but for all other encodings, SD-ID will be
required.


Yes - that's my suggestion.



Note most other encoding methods already imply the language used, for
example, in Chinese, there are several encoding methods, Traditional
Chinese used in Taiwan and Hong Kong is Big5, and simplified Chinese
used in Mainland China is GBK, so if the message is in traditional
Chinese char, it will be shown as [enc=Big5, lang=Traditional
Chinese], a little bit redundant. The Big5 also includes all English
char so it can be a mix of Chinese and English.


Good point.  As far as I can tell, Big5 is not recognized by any 
accredited standards developing organization.  It is recognized by the 
Ideographic Rapporteur Group (IRG) which reports to the Unicode 
consortium.  The recognized way to represent Chinese characters, 
traditional and simplified, is through ISO 639-2 with the subcodes to 
indicate traditional and simplified for the zh _language_.  The ID on 
Tags for Identifying Languages


  http://www.ietf.org/internet-drafts/draft-ietf-ltru-registry-14.txt

identifies simplified Chinese as zh-Hans and traditional Chinese as 
zh-Hant.  Additional subtags could identify a locale such as 
zh-Hant-TW for Taiwan Chinese in traditional script.  This is from the 
Initial Language Subtag Registry ID.


http://www.ietf.org/internet-drafts/draft-ietf-ltru-initial-06.txt

I think that we should specify encoding and language tags as 
striaghtforward as possible and let others augment syslog-protocol (in the 
future) with other encoding mechanisms.  We can RECOMMEND that encoding be 
in UTF-8 and language tags come from RFC 3066.  We can allow that other 
encoding and language identifications are acceptable.  In the worst case, 
a vendor will have the option of [EMAIL PROTECTED]something [EMAIL PROTECTED]piglatin].


Does this work for you?

Thanks,
Chris





Regards,

Sheran

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick
(clonvick)
Sent: Tuesday, November 29, 2005 10:22 AM
To: Rainer Gerhards
Cc: [EMAIL PROTECTED]
Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)

Hi Rainer,

Why don't we look at it from the other direction?  We could state that
any encoding is acceptable - for ease-of-use/migration with existing
syslog implementations.  It is RECOMMENDED that UTF-8 be used.  When it
is used, an SD-ID element will be REQUIRED.  e.g. - [enc=utf-8
lang=en]

Thoughts?

All:  Let's discuss this and close this issue.

Thanks,
Chris

On Tue, 29 Nov 2005, Rainer Gerhards wrote:


Chris  WG,


#5 Character encoding in MSG: due to my proof-of-concept
  implementation, I have raised the (ugly) question if we need
  to allow encodings other than UTF-8. Please note that this
  question arises from needs introduced by e.g. POSIX. So we
  can't easily argue them away by whishful thinking ;)

Not even discussed yet.


I haven't reviewed that yet.  However, I'll note that allowing
different encoding can be accomplished in the future as long as we
establish a default encoding and a way to identify it in our current
work.


I have read a little in the mailing archive. Please note that in 2000
it was consensus that the MSG part may contain encodings other then
US-ASCII. Follow this threat:

http://www.syslog.cc/ietf/autoarc/msg00127.html

This discussion lead to RFC 3164 saying other encodings MAY be used.
While this was observed behaviour, we need still to be aware that the
POSIX (and glibc) API places the restrictions on us that we simply do
not know the character encoding used by the application. As such, no
*nix syslogd can be programmed to be compliant to syslog-protocol if
we demand UTF-8 exclusively.

I propose that we RECOMMEND UTF-8 that MUST start with the Unicode
Byte Order Mask (BOM) if used. If the MSG part does not start with the



BOM, it may be any encoding just as in RFC 3164. I do not see any
alternative to this.

Rainer

___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog



___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog

RE: [Syslog] #5 - character encoding (was: Consensus?)

2005-11-30 Thread Rainer Gerhards
Chris,

I agree to all but one point - only that one quoted here...


  Also want to clarify that you suggest that if the message 
 is in ASCII,
  it will not required SD-ID, but for all other encodings, 
 SD-ID will be
  required.
 
 Yes - that's my suggestion.

I am sorry, we can not do this.  The whole issue is rooted in POSIX
APIs. You need to look at it why it is such a problem. On Windows, you
know what character encodings you are dealing with. On Unix, you
actually just get a bunch of octets - and nobody tells you what it is.
So the poor Unix syslogd actually has no idea of what it handles and
likewise does not know what to place in that field ;) If it knew it were
this or that encoding, I would be very tempted to request it to convert
to UTF-8. But the need behind this encoding is *NOT* to allow the
multitude of whatever currently is in existence but rather provide a way
to let a syslogd that needs to omit a bunch of octets do that.

Does this clarify? I can provide code if that would be helpful...

Rainer

___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog


RE: [Syslog] #5 - character encoding (was: Consensus?)

2005-11-30 Thread Chris Lonvick

Hi Rainer,

I believe that we are saying the same thing.  :)

If there is no indicator of encoding or language then a reciever will not 
know what it is receiving - just like receivers don't know what they are 
receiving today.  They MAY make an assumption that it is something in 
US-ASCII (but may be disappointed).


If there is an indicator of the encoding and language then the receiver 
will know exactly what it is.  Having an indicator should be RECOMMENDED 
but not REQUIRED for ease of migration.


Is that what we're all saying?

Thanks,
Chris



On Wed, 30 Nov 2005, Rainer Gerhards wrote:


Chris,


Let's use this email as an example.  :)  There is no
indication that I'm
using US-ASCII encoding or that I'm writing in English.


I think there actually is. If I am right, the SMTP RFCs require mail text to be 
US-ASCII. Only via MIME and/or escape characters you can include 8-bit data. For example 
Müller and Möller might create some problems in some mailers (But I guess my Mail system 
will encode them with =hexval). Dropping messages with octets  127 in the 
subject is a common spam protection setting...


However, you're
able to recieve this and read it.  Similarly, you could write
an email in
German and send it to me.  I would still be able to recieve
it but I'd
have a difficult time parsing the meaning.

I'm suggesting that same approach for the transmission of the syslog
content.  If I really wanted you to know what encoding and
language I'm
using in an email, I would specify a mime header.  syslog
senders will
continue to pump out whatever encoding and language they've
been using
and recievers will continue to do their best to parse them.
If a vendor
wants to get very specific about that, then they will have to
use an SD-ID
to identify the contents of the message.


Here I agree with you. What I was saying is that IF the header says it is US-ASCII, only 
then we should assume it actually is. If there is no enc SD-ID, then we do 
not know what it is but can assume ... whatever we assume. Let me phrase it that way:

If the message contains

[enc=us-ascii lang=en]

then the receiver can honestly expect it to be US-ASCII. But if it does not contain any 
enc the receiver does not know exactly and assume anything it finds useful 
(may be ASCII, may not).

Does this clarify? I somehow have the impression we mean the same thing and I 
simply do not manage to convey what I intend to ;)

Rainer



Mit Aufrichtigkeit,
Chris




On Wed, 30 Nov 2005, Rainer Gerhards wrote:


Andrew,


Hi Rainer,

Why don't we look at it from the other direction?  We could

state that any

encoding is acceptable - for ease-of-use/migration with

existing syslog

implementations.  It is RECOMMENDED that UTF-8 be used.

When it is

used, an SD-ID element will be REQUIRED.  e.g. -

[enc=utf-8 lang=en]

I like that idea too.

So, if no SD-ID encoding element is specified, then we must
assume US-ASCII
and deal with it accordingly??


I think not. If it is not present, we known that we do not

know it. If

it is US-ASCII, I would expect something like

[enc=us-ascii lang=en]

Of course, we could also say if it is non-present, we can assume
US-ASCII. But then we would need to introduce

[enc=unknown]

for the (common) case where we simply do not know it (again: think
POSIX). I find this somehwat confusing.

Rainer



___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog


RE: [Syslog] #5 - character encoding (was: Consensus?)

2005-11-30 Thread Rainer Gerhards
Chris,

I fully agree - thanks ;)

Rainer 

 -Original Message-
 From: Chris Lonvick [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, November 30, 2005 2:39 PM
 To: Rainer Gerhards
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)
 
 Hi Rainer,
 
 I believe that we are saying the same thing.  :)
 
 If there is no indicator of encoding or language then a 
 reciever will not 
 know what it is receiving - just like receivers don't know 
 what they are 
 receiving today.  They MAY make an assumption that it is something in 
 US-ASCII (but may be disappointed).
 
 If there is an indicator of the encoding and language then 
 the receiver 
 will know exactly what it is.  Having an indicator should be 
 RECOMMENDED 
 but not REQUIRED for ease of migration.
 
 Is that what we're all saying?
 
 Thanks,
 Chris
 
 
 
 On Wed, 30 Nov 2005, Rainer Gerhards wrote:
 
  Chris,
 
  Let's use this email as an example.  :)  There is no
  indication that I'm
  using US-ASCII encoding or that I'm writing in English.
 
  I think there actually is. If I am right, the SMTP RFCs 
 require mail text to be US-ASCII. Only via MIME and/or escape 
 characters you can include 8-bit data. For example Müller and 
 Möller might create some problems in some mailers (But I 
 guess my Mail system will encode them with =hexval). 
 Dropping messages with octets  127 in the subject is a 
 common spam protection setting...
 
  However, you're
  able to recieve this and read it.  Similarly, you could write
  an email in
  German and send it to me.  I would still be able to recieve
  it but I'd
  have a difficult time parsing the meaning.
 
  I'm suggesting that same approach for the transmission of 
 the syslog
  content.  If I really wanted you to know what encoding and
  language I'm
  using in an email, I would specify a mime header.  syslog
  senders will
  continue to pump out whatever encoding and language they've
  been using
  and recievers will continue to do their best to parse them.
  If a vendor
  wants to get very specific about that, then they will have to
  use an SD-ID
  to identify the contents of the message.
 
  Here I agree with you. What I was saying is that IF the 
 header says it is US-ASCII, only then we should assume it 
 actually is. If there is no enc SD-ID, then we do not know 
 what it is but can assume ... whatever we assume. Let me 
 phrase it that way:
 
  If the message contains
 
  [enc=us-ascii lang=en]
 
  then the receiver can honestly expect it to be US-ASCII. 
 But if it does not contain any enc the receiver does not 
 know exactly and assume anything it finds useful (may be 
 ASCII, may not).
 
  Does this clarify? I somehow have the impression we mean 
 the same thing and I simply do not manage to convey what I 
 intend to ;)
 
  Rainer
 
 
  Mit Aufrichtigkeit,
  Chris
 
 
 
 
  On Wed, 30 Nov 2005, Rainer Gerhards wrote:
 
  Andrew,
 
  Hi Rainer,
 
  Why don't we look at it from the other direction?  We could
  state that any
  encoding is acceptable - for ease-of-use/migration with
  existing syslog
  implementations.  It is RECOMMENDED that UTF-8 be used.
  When it is
  used, an SD-ID element will be REQUIRED.  e.g. -
  [enc=utf-8 lang=en]
 
  I like that idea too.
 
  So, if no SD-ID encoding element is specified, then we must
  assume US-ASCII
  and deal with it accordingly??
 
  I think not. If it is not present, we known that we do not
  know it. If
  it is US-ASCII, I would expect something like
 
  [enc=us-ascii lang=en]
 
  Of course, we could also say if it is non-present, we can assume
  US-ASCII. But then we would need to introduce
 
  [enc=unknown]
 
  for the (common) case where we simply do not know it (again: think
  POSIX). I find this somehwat confusing.
 
  Rainer
 
 
 
 

___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog


RE: [Syslog] #5 - character encoding (was: Consensus?)

2005-11-30 Thread Shyyunn Lin \(sheranl\)
Chris:

I agree with all your points. Recommend an encoding and standard lang
tag, and accept all other encoding and lang specification.

Regards,
 
Sheran

-Original Message-
From: Chris Lonvick (clonvick) 
Sent: Wednesday, November 30, 2005 5:06 AM
To: Shyyunn Lin (sheranl)
Cc: [EMAIL PROTECTED]
Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)

Hi Sheran,

On Tue, 29 Nov 2005, Shyyunn Lin (sheranl) wrote:

 Chris:

 I think having SD-ID with [enc=utf-8 lang=English] may be a good 
 approach. If different language use utf-8 encoding, then lang= can 
 distinguish it.

We _should_ be using language codes from RFC 3066.  That specifies ISO
639 language tags.  639-1 has 2 character codes (en is English) and
639-2 has 3 characters (eng is English).  RFC 3066 will likely be
replaced by the works of the Language Tag Registry Update (ltru) Working
Group.
   http://www.ietf.org/html.charters/ltru-charter.html
They have IDs in the works.  Until those become RFCs we should continue
to reference RFC 3066.


 Also want to clarify that you suggest that if the message is in ASCII,

 it will not required SD-ID, but for all other encodings, SD-ID will be

 required.

Yes - that's my suggestion.


 Note most other encoding methods already imply the language used, for 
 example, in Chinese, there are several encoding methods, Traditional 
 Chinese used in Taiwan and Hong Kong is Big5, and simplified Chinese 
 used in Mainland China is GBK, so if the message is in traditional 
 Chinese char, it will be shown as [enc=Big5, lang=Traditional 
 Chinese], a little bit redundant. The Big5 also includes all English 
 char so it can be a mix of Chinese and English.

Good point.  As far as I can tell, Big5 is not recognized by any
accredited standards developing organization.  It is recognized by the
Ideographic Rapporteur Group (IRG) which reports to the Unicode
consortium.  The recognized way to represent Chinese characters,
traditional and simplified, is through ISO 639-2 with the subcodes to
indicate traditional and simplified for the zh _language_.  The ID on
Tags for Identifying Languages

   http://www.ietf.org/internet-drafts/draft-ietf-ltru-registry-14.txt

identifies simplified Chinese as zh-Hans and traditional Chinese as
zh-Hant.  Additional subtags could identify a locale such as
zh-Hant-TW for Taiwan Chinese in traditional script.  This is from the
Initial Language Subtag Registry ID.

http://www.ietf.org/internet-drafts/draft-ietf-ltru-initial-06.txt

I think that we should specify encoding and language tags as
striaghtforward as possible and let others augment syslog-protocol (in
the
future) with other encoding mechanisms.  We can RECOMMEND that encoding
be in UTF-8 and language tags come from RFC 3066.  We can allow that
other encoding and language identifications are acceptable.  In the
worst case, a vendor will have the option of [EMAIL PROTECTED]something
[EMAIL PROTECTED]piglatin].

Does this work for you?

Thanks,
Chris




 Regards,

 Sheran

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick
 (clonvick)
 Sent: Tuesday, November 29, 2005 10:22 AM
 To: Rainer Gerhards
 Cc: [EMAIL PROTECTED]
 Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)

 Hi Rainer,

 Why don't we look at it from the other direction?  We could state that

 any encoding is acceptable - for ease-of-use/migration with existing 
 syslog implementations.  It is RECOMMENDED that UTF-8 be used.  When 
 it is used, an SD-ID element will be REQUIRED.  e.g. - [enc=utf-8
 lang=en]

 Thoughts?

 All:  Let's discuss this and close this issue.

 Thanks,
 Chris

 On Tue, 29 Nov 2005, Rainer Gerhards wrote:

 Chris  WG,

 #5 Character encoding in MSG: due to my proof-of-concept
   implementation, I have raised the (ugly) question if we need
   to allow encodings other than UTF-8. Please note that this
   question arises from needs introduced by e.g. POSIX. So we
   can't easily argue them away by whishful thinking ;)

 Not even discussed yet.

 I haven't reviewed that yet.  However, I'll note that allowing 
 different encoding can be accomplished in the future as long as we 
 establish a default encoding and a way to identify it in our current

 work.

 I have read a little in the mailing archive. Please note that in 2000

 it was consensus that the MSG part may contain encodings other then 
 US-ASCII. Follow this threat:

 http://www.syslog.cc/ietf/autoarc/msg00127.html

 This discussion lead to RFC 3164 saying other encodings MAY be
used.
 While this was observed behaviour, we need still to be aware that the

 POSIX (and glibc) API places the restrictions on us that we simply do

 not know the character encoding used by the application. As such, no 
 *nix syslogd can be programmed to be compliant to syslog-protocol if 
 we demand UTF-8 exclusively.

 I propose that we RECOMMEND UTF-8 that MUST start with the Unicode 
 Byte Order Mask (BOM) if used. If the MSG

RE: [Syslog] #5 - character encoding (was: Consensus?)

2005-11-29 Thread Shyyunn Lin \(sheranl\)
Chris:

I think having SD-ID with [enc=utf-8 lang=English] may be a good
approach. If different language use utf-8 encoding, then lang= can
distinguish it. 

Also want to clarify that you suggest that if the message is in ASCII,
it will not required SD-ID, but for all other encodings, SD-ID will be
required.

Note most other encoding methods already imply the language used, for
example, in Chinese, there are several encoding methods, Traditional
Chinese used in Taiwan and Hong Kong is Big5, and simplified Chinese
used in Mainland China is GBK, so if the message is in traditional
Chinese char, it will be shown as [enc=Big5, lang=Traditional
Chinese], a little bit redundant. The Big5 also includes all English
char so it can be a mix of Chinese and English.  



Regards,
 
Sheran

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Chris Lonvick
(clonvick)
Sent: Tuesday, November 29, 2005 10:22 AM
To: Rainer Gerhards
Cc: [EMAIL PROTECTED]
Subject: RE: [Syslog] #5 - character encoding (was: Consensus?)

Hi Rainer,

Why don't we look at it from the other direction?  We could state that
any encoding is acceptable - for ease-of-use/migration with existing
syslog implementations.  It is RECOMMENDED that UTF-8 be used.  When it
is used, an SD-ID element will be REQUIRED.  e.g. - [enc=utf-8
lang=en]

Thoughts?

All:  Let's discuss this and close this issue.

Thanks,
Chris

On Tue, 29 Nov 2005, Rainer Gerhards wrote:

 Chris  WG,

 #5 Character encoding in MSG: due to my proof-of-concept
   implementation, I have raised the (ugly) question if we need
   to allow encodings other than UTF-8. Please note that this
   question arises from needs introduced by e.g. POSIX. So we
   can't easily argue them away by whishful thinking ;)

 Not even discussed yet.

 I haven't reviewed that yet.  However, I'll note that allowing 
 different encoding can be accomplished in the future as long as we 
 establish a default encoding and a way to identify it in our current 
 work.

 I have read a little in the mailing archive. Please note that in 2000 
 it was consensus that the MSG part may contain encodings other then 
 US-ASCII. Follow this threat:

 http://www.syslog.cc/ietf/autoarc/msg00127.html

 This discussion lead to RFC 3164 saying other encodings MAY be used.
 While this was observed behaviour, we need still to be aware that the 
 POSIX (and glibc) API places the restrictions on us that we simply do 
 not know the character encoding used by the application. As such, no 
 *nix syslogd can be programmed to be compliant to syslog-protocol if 
 we demand UTF-8 exclusively.

 I propose that we RECOMMEND UTF-8 that MUST start with the Unicode 
 Byte Order Mask (BOM) if used. If the MSG part does not start with the

 BOM, it may be any encoding just as in RFC 3164. I do not see any 
 alternative to this.

 Rainer

 ___
 Syslog mailing list
 Syslog@lists.ietf.org
 https://www1.ietf.org/mailman/listinfo/syslog


___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog

___
Syslog mailing list
Syslog@lists.ietf.org
https://www1.ietf.org/mailman/listinfo/syslog