On Fri, Aug 05, 2005 at 11:59:07AM +0200, Paweł Pałucha wrote: > > I'm sending another patch: > > - for uri.c - some characters added to second argument of > xmlURIEscapeStr() to be a little more RFC2396 compatible (so please > ignore previous patch) > - for nanohttp.c - uri fragments are escaped while creating nanohttp > context, so they are properly escaped in HTTP request
Give me a bit of time to read this. I want it fixed, but I want it fixed for good :-) > I know it isn't the best solution because some strange urls can be > messed up with escaping/unescaping. But at least I can get The only time in libxml2 where we should unescape is when we take a relative URI as a Path, and this is a grey area anyway since at least on Unix paths don't have an absolute encoding they are interpreted as sequences of bytes expected to be in the user's locale default encoding. I don't want to make libxml2 rely on the locale settings. > 'http://alpha/~pawel/żółty żółw.xml' from my server, which is not > possible with current libxml2 state. The problem is that this string taken in isolation doesn't mean much even if you think it's is an URI. If it is embedded as an URI-Reference within an XML document, then at least you know the encoding inherited from the context document, and conversion to Unicode code-points and then to a properly UTF-8 and then escaped URL is possible. Unfortunately taken in isolation (for example in the context of this mail without encoding indication, or as a libxml2 xmlReadFile argument, this is just a sequence of bytes, and you should never rely on this to work, because it *will* break in general, see the best practice suggested: http://www.w3.org/TR/2004/CR-charmod-resid-20041122/#C060 Encode to UTF-8 and then do byte by byte URI escaping I.e. when trying to use such an URI 1/ you should not use it as is unless you have a clear encoding infered from the context 2/ if there is any risk that the encoding may be misunderstood, then convert to UTF-8 and URI escape, i.e. the first letter ż will be converted to two sequences %xy%zq and not a single one based on the byte value in the ISO Latin code. The resulting ASCII sequence will be completely unambiguous and can't be messed up by layers in the stack. Yes I18N is a scary mess ... Daniel -- Daniel Veillard | Red Hat Desktop team http://redhat.com/ [EMAIL PROTECTED] | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] http://mail.gnome.org/mailman/listinfo/xml
