Re: g_filename_to_uri() issue in glib-win32
On Wed, May 23, 2012 at 06:48:31AM +0100, John Emmas wrote: But whatever that (second) character looked like, it's decimal value would always be 246 (because the UTF-8 sequence C3 B6 translates to decimal 246). The URI translation of decimal 246 is %F6. This is nonsense. Percent-encoding consists of % followed by *two* hexadecimal digits and encodes *bytes*, see http://tools.ietf.org/html/rfc3986#section-2.1 If things worked as you suggest you would not be able to encode any codepoint larger than 255 and the entire thing would be pretty useless. Yeti ___ gtk-devel-list mailing list gtk-devel-list@gnome.org http://mail.gnome.org/mailman/listinfo/gtk-devel-list
Re: g_filename_to_uri() issue in glib-win32
On Wed, 2012-05-23 at 06:48 +0100, John Emmas wrote: But whatever that (second) character looked like, it's decimal value would always be 246 (because the UTF-8 sequence C3 B6 translates to decimal 246). The URI translation of decimal 246 is %F6. U+00F6 is the Unicode codepoint but URI percent encoding never directly uses codepoints as you can encode only a single byte at a time and the range of Unicode codepoints is much larger than that (up to U+10). As Krzysztof already wrote, byte-wise encoding of UTF-8 strings is the generally recommended way to encode URIs. See also the following links: http://tools.ietf.org/html/rfc3987#section-6.4 http://www.w3.org/International/O-URL-code.html Regards, Jürg ___ gtk-devel-list mailing list gtk-devel-list@gnome.org http://mail.gnome.org/mailman/listinfo/gtk-devel-list
Re: g_filename_to_uri() issue in glib-win32
On 23 May 2012, at 08:40, Jürg Billeter wrote: U+00F6 is the Unicode codepoint but URI percent encoding never directly uses codepoints as you can encode only a single byte at a time and the range of Unicode codepoints is much larger than that (up to U+10). As Krzysztof already wrote, byte-wise encoding of UTF-8 strings is the generally recommended way to encode URIs. See also the following links: http://tools.ietf.org/html/rfc3987#section-6.4 http://www.w3.org/International/O-URL-code.html Oops, sorry Jürg, I also meant to thank you for those links. Still a bit confused really... :-( John ___ gtk-devel-list mailing list gtk-devel-list@gnome.org http://mail.gnome.org/mailman/listinfo/gtk-devel-list
Re: g_filename_to_uri() issue in glib-win32
On 23 May 2012, at 10:05, John Emmas wrote: Still a bit confused really... :-( Not any more My confusion arose from the fact that the notes for g_filename_to_uri() (i.e. the note inside gconvert.c) states that its based on the requirements of RFC 2396:- http://www.ietf.org/rfc/rfc2396.txt which has no requirement for a URI string to be pre-converted into UTF-8. However, I've just been told by a colleague that RFC 2396 has been superseded by RFC 3986:- http://tools.ietf.org/html/rfc3986 by which time, the requirement had been introduced (see the very last paragraph of section 2). So yes - it looks like Glib is doing the right thing and the problem must be elsewhere. Thanks for everyone's help with this. John___ gtk-devel-list mailing list gtk-devel-list@gnome.org https://mail.gnome.org/mailman/listinfo/gtk-devel-list
g_filename_to_uri() issue in glib-win32
I'm using the Glib function g_filename_to_uri() in glib-win32 (version 2.24). According to the documentation I should pass in a file path in the encoding format used by Glib (which on Windows is UTF-8). However, if I pass in a UTF-8 string, this function translates it character-by-character (as if it was plain ASCII). i.e. it doesn't recognise that the string is UTF-8. So for example, if the input string is Göran (encoded as UTF-8) I get the wrong output (hopefully, you can see that the 'o' has an umlaut). g_filename_to_uri encodes 6 characters and returns G%C3%B6ran instead of encoding just 5 characters to return the correct URI string G%F6ran. I can work around the problem by filtering my string through g_locale_from_utf8() before sending it to g_filename_to_uri() but I think that g_filename_to_uri() should be doing that for itself (either that - or the documentation's wrong). Can anyone confirm if this is a bug or intended behaviour? If it's a bug, is it fixed yet in the latest glib version? Thanks though for an otherwise great product. John Emmas ___ gtk-devel-list mailing list gtk-devel-list@gnome.org http://mail.gnome.org/mailman/listinfo/gtk-devel-list
Re: g_filename_to_uri() issue in glib-win32
2012/5/22 John Emmas john...@tiscali.co.uk: So for example, if the input string is Göran (encoded as UTF-8) I get the wrong output (hopefully, you can see that the 'o' has an umlaut). g_filename_to_uri encodes 6 characters and returns G%C3%B6ran instead of encoding just 5 characters to return the correct URI string G%F6ran. What you get is an URI encoding of the UTF-8 bytes. I think this is the expected and correct behavior: there are multiple incompatible locale encodings and there's no way for this function to know what encoding you want to use for the URI. It would also fail if you had characters not representable in the locale encoding. This is at most a documentation bug. It should be stated that this function converts the string byte-by-byte, and everything outside of the 0-127 range is converted to hex escapes. (I think this is the only sensible behavior for this function.) Regards, Krzysztof ___ gtk-devel-list mailing list gtk-devel-list@gnome.org http://mail.gnome.org/mailman/listinfo/gtk-devel-list
Re: g_filename_to_uri() issue in glib-win32
On 23 May 2012, at 00:22, Krzysztof Kosiński wrote: What you get is an URI encoding of the UTF-8 bytes. I think this is the expected and correct behavior: there are multiple incompatible locale encodings and there's no way for this function to know what encoding you want to use for the URI. It would also fail if you had characters not representable in the locale encoding. This is at most a documentation bug. It should be stated that this function converts the string byte-by-byte, and everything outside of the 0-127 range is converted to hex escapes. Thanks for the prompt reply Krzysztof, I can see where you're coming from on this but there's another way to look at it. In my example (Göran) the UTF-8 byte sequence (for my particular code page) would have been:- 47 C3 B6 72 61 6E This would get displayed as:- G [ some codepage dependent character ] r a n But whatever that (second) character looked like, it's decimal value would always be 246 (because the UTF-8 sequence C3 B6 translates to decimal 246). The URI translation of decimal 246 is %F6. Therefore it should be possible to translate from UTF-8 [47 C3 B6 72 61 6E] into URI G%F6ran regardless of the user's code page. On my system this would say Göran whereas on someone else's system it might look different but that's not really relevant. The conversion itself is valid and shouldn't be affected by code pages. Code pages will only affect the displayed appearance. Of course, this is only with my simple example. There might be other examples where my theory breaks down. I've only considered this particular case. But if what you said was true Krzysztof, g_filename_to_utf8() would suffer from the same problem - but it doesn't. If (on Windows) you pass it a UTF-8 filename, it correctly recognises that the name is already UTF-8 and returns the original string (i.e. it doesn't attempt a new byte-by-byte conversion). So 'g_filename_to_uri()' is misbehaving AFAICT. John ___ gtk-devel-list mailing list gtk-devel-list@gnome.org http://mail.gnome.org/mailman/listinfo/gtk-devel-list