Re: [xml] spaces in uri, again

Daniel Veillard Fri, 05 Aug 2005 03:33:13 -0700

On Fri, Aug 05, 2005 at 11:59:07AM +0200, Paweł Pałucha wrote:
> 
> I'm sending another patch:
> 
> - for uri.c - some characters added to second argument of 
> xmlURIEscapeStr() to be a little more RFC2396 compatible (so please 
> ignore previous patch)
> - for nanohttp.c - uri fragments are escaped while creating nanohttp 
> context, so they are properly escaped in HTTP request


  Give me a bit of time to read this. I want it fixed, but I want it
fixed for good :-)

> I know it isn't the best solution because some strange urls can be 
> messed up with escaping/unescaping. But at least I can get 

  The only time in libxml2 where we should unescape is when we take
a relative URI as a Path, and this is a grey area anyway since at least
on Unix paths don't have an absolute encoding they are interpreted
as sequences of bytes expected to be in the user's locale default encoding. I
don't want to make libxml2 rely on the locale settings.

> 'http://alpha/~pawel/żółty żółw.xml' from my server, which is not 
> possible with current libxml2 state.

  The problem is that this string taken in isolation doesn't mean much
even if you think it's is an URI. If it is embedded as an URI-Reference
within an XML document, then at least you know the encoding inherited
from the context document, and conversion to Unicode code-points and
then to a properly UTF-8 and then escaped URL is possible. Unfortunately
taken in isolation (for example in the context of this mail without encoding
indication, or as a libxml2 xmlReadFile argument, this is just a sequence
of bytes, and you should never rely on this to work, because it *will* 
break in general, see the best practice suggested:
  http://www.w3.org/TR/2004/CR-charmod-resid-20041122/#C060
  Encode to UTF-8 and then do byte by byte URI escaping

I.e. when trying to use such an URI 
  1/ you should not use it as is unless you have a clear encoding infered
     from the context
  2/ if there is any risk that the encoding may be misunderstood, then
     convert to UTF-8 and URI escape, i.e. the first letter ż will
     be converted to two sequences %xy%zq and not a single one based
     on the byte value in the ISO Latin code. The resulting ASCII sequence
     will be completely unambiguous and can't be messed up by layers in the
     stack.

Yes I18N is a scary mess ...

Daniel

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] spaces in uri, again

Reply via email to