Hello Christian, Isn't it that anything below chr(128) converts to utf-8 as the same character? That would mean that slash and ampersand will stay as it is. OTOH encoding is done only on non-ascii characters. Supposed that the encoding is utf-8. What's hardwired into absoluteURL.
Monday, March 1, 2010, 4:40:30 PM, you wrote: CT> On 03/01/2010 03:34 PM, Wichert Akkerman wrote: >> On 3/1/10 15:09 , Christian Theune wrote: >>> Hi, >>> >>> On 03/01/2010 02:28 PM, Martin Aspeli wrote: >>>> >>>> I'm with Wichert here. >>>> >>>> In most places, we tend to carry around unicode strings internally, and >>>> only encode on the boundaries, e.g. when the URL is "rendered". I don't >>>> see why redirect() can't have a sensible and predictable policy for >>>> unicode strings, making life easier for everyone. >>>> >>>> If we think that non-ASCII URLs are illegal, then maybe we should >>>> validate for that and throw an error. However, I don't think that's the >>>> case (anymore?). In that case, passing a unicode object to the function >>>> seems entirely consistent with other places, e.g. when we pass unicode >>>> to the page template engine or return unicode from a view, which the >>>> publisher then encodes before it's pushed down to the client. >>> >>> I opened a question in another part of the thread, but haven't gotten an >>> answer yet. In my understanding, a Unicode string is not able to >>> represent the structural properties of a URL in http scheme properly, >>> thus encoding back to ASCII is not possible. >>> >>> Can someone confirm or disprove this? >> >> I am not sure what you mean. On the wire you get a path component in a >> HTTP get request which is UTF-8 encoded and escaped. For example >> http://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8 >> >> , which is a Japanese string if you decode it back to unicode. That >> encoding works fine in two directions, and all other properties used in >> the http scheme such as query strings and fragments work normally. Can >> you provide an example of something that might not work? CT> The problem is that a URI has internal structure which looks to me like CT> it can't be reconstructed properly if it was decoded into a "regular" CT> unicode string. CT> E.g. reserved characters are probably decoded into their regular symbols CT> (e.g. a slash embedded in a path component or ampersands used in query CT> arguments), so escaping needs to be done (manually) before encoding. CT> Also, some parts of a URI can use other ways to encode symbols. CT> Hostnames would like to be encoded to punycode whereas URIs don't even CT> say what character set unicode characters should be encoded to. That CT> would be up to the application (e.g. our publisher, so that's manageable). CT> I have the feeling that roundtrip behaviour of URI -> unicode string -> CT> URI won't be possible fully correctly and thus may be susceptible to CT> interference from the outside. CT> I still hope we can do better than doing nothing about it. I just think CT> it's more complex than calling encode('something'). ;) CT> Christian -- Best regards, Adam GROSZER mailto:[email protected] -- Quote of the day: Reflect upon your present blessings - of which every man has many- not on your past misfortunes, of which all men have some. - Charles Dickens _______________________________________________ Zope-Dev maillist - [email protected] https://mail.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - https://mail.zope.org/mailman/listinfo/zope-announce https://mail.zope.org/mailman/listinfo/zope )
