RE: off-topic: handling non-ascii characters in URLs

Kitching Simon Fri, 05 Jan 2001 05:06:30 -0800

> -----Original Message-----
> From: Birte Glimm [SMTP:[EMAIL PROTECTED]]
> Sent: Friday, January 05, 2001 12:15 PM
> To:   [EMAIL PROTECTED]
> Subject:      RE: off-topic: handling non-ascii characters in URLs
> 
> True,
> it`s the Browser that encodes the special chars I think. I sometimes had
> problems with not encoded URL`s in Netscape, but the IE always translates
> them right.
> Birte Glimm
> 
        [Kitching Simon]  
        The problem is that there are multiple different encoding schemes. 
        If IE is "translating them right" then what rules exactly is it
following?

        Characters are transmitted as bytes (ie a number from 0 to 255);
        in order for two communicating parties to interpret a particular
code 
        correctly, they need to agree on what encoding scheme to use -
either
        in advance, or by the sending party indicating the encoding scheme.
I 
        can't find where in the specs it says how to define the encoding
scheme 
        for characters in urls.

        As an example, a webserver might interpret data like:
        * urls are always 7-bit-ascii
        * urls are always latin-1
        * urls are always UTF-8
        or there is some way to define the encoding of a url when sending a
url to 
        a webserver - but I can't see how.

        Note that the byte 0xE9 can mean different things:
        * in 7-bit-ascii, it is invalid
        * in latin-1 it is an e-accent
        * in latin-2 (nordic languages) it is something else
        * in UTF-8, it is interpreted as the first byte of a multi-byte
character.
        etc.

        In practice, it seems to me that latin-1 (ie ISO-8859-1) is being
used, ie
        for those of us who don't use any character not in latin-1, we don't
see
        any problems. However, I can't see anywhere in the specs that says
that
        HTTP-compliant apps must use latin-1. And what happens if you want
to
        use non-latin-1 characters in a url, or in a form using
method="GET"? 
        Examples of languages using characters not in latin-1 are turkish,
hebrew, 
        finnish, chinese, ...

        Here is an interesting quote from RFC 2396
(http://www.ietf.org/rfc/rfc2396.txt):

        "A URI is a sequence of characters from a very limited set, i.e. the
letters
        of the basic Latin alphabet, digits, and a few special characters".

        This tends to imply that all non-ascii characters *must* be
transformed into
        a %xx form; that's fine (with the implication that data sent to a
webserver
        via GET must also be encoded in this way), but the %xx still is an
index into
        **some unknown character set**!!! How can the recipient (eg a
webserver) know
        which character set is it an index into?

        Another quote from RFC 2396:

        "Internet protocols that transmit octet sequences intended to
represent character
        sequences are expected to provide some way of identifying the
charset used, if
        there might be more than one [RFC2277]. However, there is currently
no provision
        within the generic URI syntax to accomplish this identification".

        This says clearly that it is the HTTP protocol's responsibility to
find some way to
        define the character set used in URLs transmitted over HTTP - which
leads back
        to the HTTP RFC, in which I could find no such way of defining the
charset for URIs
        in the situation where a browser is sending a request to a web
server.

        ????? 

        Perhaps someone out there working in Japanese/Chinese/similar can
give some
        feedback on this? You must have to deal with this all the time...

        Cheers,

        Simon

> -----Original Message-----
> From: Kitching Simon [mailto:[EMAIL PROTECTED]]
> Sent: Freitag, 5. Januar 2001 11:58
> To: '[EMAIL PROTECTED]'
> Subject: off-topic: handling non-ascii characters in URLs
> 
> 
> Hi All,
> 
> While following a related thread (RE: a simple test to charset), 
> a question occured to me about charset encodings in URLs. 
> This isn't really tomcat-related (more to do with HTTP standards) 
> but thought someone here might be able to offer an answer.
> 
> When a webserver sends content to a browser, it can indicate
> the character data format (ascii, latin-1, UTF8, etc) as an http
> header. However, how is the character data type specified for data
> send *by* a browser *to* a webserver (ie GET or POST action)?
> 
> Andre Alves had an example where an e-accent character
> was part of the URL. I saw that IE4 replaced this character
> with %E9 when submitting a form using GET method, but this
> really assumes that the receiving webserver is using latin-1.
> 
> There is this thing called an "entity-header" defined in the HTTP
> specs, which may contain a "content-encoding" entry. This seems
> to cover POST urls ok then, as the POSTed data is in an entity-body,
> and therefore an entity-header can be used to define its encoding.
> 
> But the URLs themselves cannot have their encoding specified by 
> an entity-header, because they are not in an entity-body. So does
> this mean that all URLs should be restricted to ascii, and forms
> should not use GET method unless their data content is guarunteed
> to be all-ascii??  I remember seeing an article recently about domain
> names now being available in asian ideogram characters, which seems
> to indicate otherwise....
> 
> Any comments?
> 
> Cheers,
> 
> Simon
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, email: [EMAIL PROTECTED]
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]
RE: off-topic: handling non-ascii characters in URLs

Reply via email to