Hi Philippe, Thanks for the note.
The announcement here was purely informational. This is off-topic to this list and thus further comments really should be carried off to [EMAIL PROTECTED] Cross posting with this list is a Bad Idea, IMHO. I have not cross posted this note to prevent any thread "over there" from escaping back to the Unicode list. I HAVE posted a response to your message to you privately, copy that list. Thanks again for the comments. Addison Addison P. Phillips Director, Globalization Architecture webMethods | Delivering Global Business Visibility http://www.webMethods.com Chair, W3C Internationalization (I18N) Working Group Chair, W3C-I18N-WG, Web Services Task Force http://www.w3.org/International Internationalization is an architecture. It is not a feature. > -----Original Message----- > From: Philippe Verdy [mailto:[EMAIL PROTECTED] > Sent: Wednesday, November 19, 2003 3:51 PM > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: Proposed Successor to RFC 3066 (language tags) > > > From: Addison Phillips [wM] > > Please note that there is a discussion list for this topic at: > [EMAIL PROTECTED] > > > > While Mark and I welcome your comments here or privately, off-list, you > can best be > > a part of the discussion by joining that list. Join the list by > sending a > request email > > to: [EMAIL PROTECTED] > > I note that the language tags proposal includes the following EBNF > productions for extensions that may be padded after the language code, > script code, region code, variant code: > > extensions = "-x" 1* ("-" key "=" value) > key = ALPHA *alphanum > value = 1* utf8uri > alphanum = (ALPHA / DIGIT) > utf8uri = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG)) > > Under this new scheme, the following language tag may be valid: > "sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0" > which here would mean: { > language="sr"; // Serbian > script="Latn"; // Latin > region="SP"; // Serbia-Montenegro > variant="2003"; > extensions="-x"; { > href="http://www.iana.org/" > version="1.0" > } > } > > However the problem with that scheme is its new use of characters "%" and > "=". There are a lot of applications that where not expecting > something else > in this field than just alphanum and "-" or "_" or ".", so that > the language > tag could safely be used without specific escaping within URIs > (for example > in HTTP GET URLs) or as options of a MIME type (I take a sample > here, which > may not correspond to an existing option of the "text/plain" MIME type): > > Content-Encoding: text/plain; charset=UTF-8; > lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0 > > This may break the compatiblity of some parsers if such "extended language > tags" are found there, as there are two "=" signs within the value of the > "lang=" option. > > For GET URLs, these extra "%" and "=" will need to be URL-encoded to get > through correctly, as the following would become possible and prone to > generate form data parsing errors: > > http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x- href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0 I think it's quite strange that these extensions have not used the existing restricted encoding set to encode them, instead on relying on "%" and "=". Why not using "_" instead of "=" and "." instead of "%", like this: "sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0" (same meaning as the first example above). But at least this draft offers a good starting point to indicate locales more precisely. I fully support the new reference to the ISO-15924 standard for the script code, and for documenting the legal values of variant codes (either a year with possible era, or a registered tag), as well as clearly indicating that languages codes should be the shortest ISO-639 codes (is it true for a few legacy languages which previously were coded with 3 letters and upgraded to 2-letter codes, until there was a policy not to do it anymore in the future?) Where does it affect Unicode, I don't know, except in its possible normative data tables which may contain future language code conditions, or in Language tags inserted in the Unicode encoded texts. Does Unicode need these extensions?