OFF-TOPIC: Proposed Successor to RFC 3066 (language tags)

Addison Phillips [wM] Wed, 19 Nov 2003 18:47:15 -0800

Hi Philippe,

Thanks for the note.


The announcement here was purely informational. This is off-topic to this list and 
thus further comments really should be carried off to [EMAIL PROTECTED] Cross posting 
with this list is a Bad Idea, IMHO. I have not cross posted this note to prevent any 
thread "over there" from escaping back to the Unicode list. I HAVE posted a response 
to your message to you privately, copy that list.

Thanks again for the comments.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -----Original Message-----
> From: Philippe Verdy [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, November 19, 2003 3:51 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Proposed Successor to RFC 3066 (language tags)
> 
> 
> From: Addison Phillips [wM]
> > Please note that there is a discussion list for this topic at:
> [EMAIL PROTECTED]
> >
> > While Mark and I welcome your comments here or privately, off-list, you
> can best be
> > a part of the discussion by joining that list. Join the list by 
> sending a
> request email
> > to:  [EMAIL PROTECTED]
> 
> I note that the language tags proposal includes the following EBNF
> productions for extensions that may be padded after the language code,
> script code, region code, variant code:
> 
> extensions  = "-x" 1* ("-" key "=" value)
> key  = ALPHA *alphanum
> value  = 1* utf8uri
> alphanum  = (ALPHA / DIGIT)
> utf8uri  = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))
> 
> Under this new scheme, the following language tag may be valid:
> "sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
> which here would mean: {
>     language="sr"; // Serbian
>     script="Latn"; // Latin
>     region="SP"; // Serbia-Montenegro
>     variant="2003";
>     extensions="-x"; {
>         href="http://www.iana.org/";
>         version="1.0"
>     }
> }
> 
> However the problem with that scheme is its new use of characters "%" and
> "=". There are a lot of applications that where not expecting 
> something else
> in this field than just alphanum and "-" or "_" or ".", so that 
> the language
> tag could safely be used without specific escaping within URIs 
> (for example
> in HTTP GET URLs) or as options of a MIME type (I take a sample 
> here, which
> may not correspond to an existing option of the "text/plain" MIME type):
> 
> Content-Encoding: text/plain; charset=UTF-8;
> lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0
> 
> This may break the compatiblity of some parsers if such "extended language
> tags" are found there, as there are two "=" signs within the value of the
> "lang=" option.
> 
> For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
> through correctly, as the following would become possible and prone to
> generate form data parsing errors:
> 
> http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x-
href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

I think it's quite strange that these extensions have not used the existing
restricted encoding set to encode them, instead on relying on "%" and "=".
Why not using "_" instead of "=" and "." instead of "%", like this:
"sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0"
(same meaning as the first example above).

But at least this draft offers a good starting point to indicate locales
more precisely.

I fully support the new reference to the ISO-15924 standard for the script
code, and for documenting the legal values of variant codes (either a year
with possible era, or a registered tag), as well as clearly indicating that
languages codes should be the shortest ISO-639 codes (is it true for a few
legacy languages which previously were coded with 3 letters and upgraded to
2-letter codes, until there was a policy not to do it anymore in the
future?)

Where does it affect Unicode, I don't know, except in its possible normative
data tables which may contain future language code conditions, or in
Language tags inserted in the Unicode encoded texts. Does Unicode need these
extensions?

OFF-TOPIC: Proposed Successor to RFC 3066 (language tags)

Reply via email to