Hi Philippe,

Thanks for the note.

The announcement here was purely informational. This is off-topic to this list and 
thus further comments really should be carried off to [EMAIL PROTECTED] Cross posting 
with this list is a Bad Idea, IMHO. I have not cross posted this note to prevent any 
thread "over there" from escaping back to the Unicode list. I HAVE posted a response 
to your message to you privately, copy that list.

Thanks again for the comments.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -----Original Message-----
> From: Philippe Verdy [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, November 19, 2003 3:51 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Proposed Successor to RFC 3066 (language tags)
> 
> 
> From: Addison Phillips [wM]
> > Please note that there is a discussion list for this topic at:
> [EMAIL PROTECTED]
> >
> > While Mark and I welcome your comments here or privately, off-list, you
> can best be
> > a part of the discussion by joining that list. Join the list by 
> sending a
> request email
> > to:  [EMAIL PROTECTED]
> 
> I note that the language tags proposal includes the following EBNF
> productions for extensions that may be padded after the language code,
> script code, region code, variant code:
> 
> extensions  = "-x" 1* ("-" key "=" value)
> key  = ALPHA *alphanum
> value  = 1* utf8uri
> alphanum  = (ALPHA / DIGIT)
> utf8uri  = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))
> 
> Under this new scheme, the following language tag may be valid:
> "sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
> which here would mean: {
>     language="sr"; // Serbian
>     script="Latn"; // Latin
>     region="SP"; // Serbia-Montenegro
>     variant="2003";
>     extensions="-x"; {
>         href="http://www.iana.org/";
>         version="1.0"
>     }
> }
> 
> However the problem with that scheme is its new use of characters "%" and
> "=". There are a lot of applications that where not expecting 
> something else
> in this field than just alphanum and "-" or "_" or ".", so that 
> the language
> tag could safely be used without specific escaping within URIs 
> (for example
> in HTTP GET URLs) or as options of a MIME type (I take a sample 
> here, which
> may not correspond to an existing option of the "text/plain" MIME type):
> 
> Content-Encoding: text/plain; charset=UTF-8;
> lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0
> 
> This may break the compatiblity of some parsers if such "extended language
> tags" are found there, as there are two "=" signs within the value of the
> "lang=" option.
> 
> For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
> through correctly, as the following would become possible and prone to
> generate form data parsing errors:
> 
> http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x-
href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

I think it's quite strange that these extensions have not used the existing
restricted encoding set to encode them, instead on relying on "%" and "=".
Why not using "_" instead of "=" and "." instead of "%", like this:
"sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0"
(same meaning as the first example above).

But at least this draft offers a good starting point to indicate locales
more precisely.

I fully support the new reference to the ISO-15924 standard for the script
code, and for documenting the legal values of variant codes (either a year
with possible era, or a registered tag), as well as clearly indicating that
languages codes should be the shortest ISO-639 codes (is it true for a few
legacy languages which previously were coded with 3 letters and upgraded to
2-letter codes, until there was a policy not to do it anymore in the
future?)

Where does it affect Unicode, I don't know, except in its possible normative
data tables which may contain future language code conditions, or in
Language tags inserted in the Unicode encoded texts. Does Unicode need these
extensions?


Reply via email to