Re: g_filename_to_uri() issue in glib-win32

2012-05-23 Thread David Nečas
On Wed, May 23, 2012 at 06:48:31AM +0100, John Emmas wrote:
 But whatever that (second) character looked like, it's decimal value would 
 always be 246 (because the UTF-8 sequence C3 B6 translates to decimal 246).
 
 The URI translation of decimal 246 is %F6.

This is nonsense.  Percent-encoding consists of % followed by *two*
hexadecimal digits and encodes *bytes*, see

http://tools.ietf.org/html/rfc3986#section-2.1

If things worked as you suggest you would not be able to encode any
codepoint larger than 255 and the entire thing would be pretty useless.

Yeti

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: g_filename_to_uri() issue in glib-win32

2012-05-23 Thread Jürg Billeter
On Wed, 2012-05-23 at 06:48 +0100, John Emmas wrote:
 But whatever that (second) character looked like, it's decimal value
 would always be 246 (because the UTF-8 sequence C3 B6 translates to
 decimal 246).
 
 The URI translation of decimal 246 is %F6.

U+00F6 is the Unicode codepoint but URI percent encoding never directly
uses codepoints as you can encode only a single byte at a time and the
range of Unicode codepoints is much larger than that (up to U+10).
As Krzysztof already wrote, byte-wise encoding of UTF-8 strings is the
generally recommended way to encode URIs. See also the following links:

http://tools.ietf.org/html/rfc3987#section-6.4
http://www.w3.org/International/O-URL-code.html

Regards,
Jürg

___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: g_filename_to_uri() issue in glib-win32

2012-05-23 Thread John Emmas

On 23 May 2012, at 08:40, Jürg Billeter wrote:

 
 U+00F6 is the Unicode codepoint but URI percent encoding never directly
 uses codepoints as you can encode only a single byte at a time and the
 range of Unicode codepoints is much larger than that (up to U+10).
 As Krzysztof already wrote, byte-wise encoding of UTF-8 strings is the
 generally recommended way to encode URIs. See also the following links:
 
 http://tools.ietf.org/html/rfc3987#section-6.4
 http://www.w3.org/International/O-URL-code.html
 

Oops, sorry Jürg, I also meant to thank you for those links.  Still a bit 
confused really...  :-(

John
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: g_filename_to_uri() issue in glib-win32

2012-05-23 Thread John Emmas
On 23 May 2012, at 10:05, John Emmas wrote:

 
 Still a bit confused really...  :-(
 

Not any more

My confusion arose from the fact that the notes for g_filename_to_uri() (i.e. 
the note inside gconvert.c) states that its based on the requirements of RFC 
2396:-

http://www.ietf.org/rfc/rfc2396.txt

which has no requirement for a URI string to be pre-converted into UTF-8.  
However, I've just been told by a colleague that RFC 2396 has been superseded 
by RFC 3986:-

http://tools.ietf.org/html/rfc3986

by which time, the requirement had been introduced (see the very last paragraph 
of section 2).

So yes - it looks like Glib is doing the right thing and the problem must be 
elsewhere.  Thanks for everyone's help with this.

John___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
https://mail.gnome.org/mailman/listinfo/gtk-devel-list


g_filename_to_uri() issue in glib-win32

2012-05-22 Thread John Emmas
I'm using the Glib function g_filename_to_uri() in glib-win32 (version 2.24).  
According to the documentation I should pass in a file path in the encoding 
format used by Glib (which on Windows is UTF-8).  However, if I pass in a UTF-8 
string, this function translates it character-by-character (as if it was plain 
ASCII).  i.e. it doesn't recognise that the string is UTF-8.

So for example, if the input string is Göran (encoded as UTF-8) I get the 
wrong output (hopefully, you can see that the 'o' has an umlaut).  
g_filename_to_uri encodes 6 characters and returns G%C3%B6ran instead of 
encoding just 5 characters to return the correct URI string G%F6ran.

I can work around the problem by filtering my string through 
g_locale_from_utf8() before sending it to g_filename_to_uri() but I think that 
g_filename_to_uri() should be doing that for itself (either that - or the 
documentation's wrong).

Can anyone confirm if this is a bug or intended behaviour?  If it's a bug, is 
it fixed yet in the latest glib version?  Thanks though for an otherwise great 
product.

John Emmas
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: g_filename_to_uri() issue in glib-win32

2012-05-22 Thread Krzysztof Kosiński
2012/5/22 John Emmas john...@tiscali.co.uk:
 So for example, if the input string is Göran (encoded as UTF-8) I get the 
 wrong output (hopefully, you can see that the 'o' has an umlaut).  
 g_filename_to_uri encodes 6 characters and returns G%C3%B6ran instead of 
 encoding just 5 characters to return the correct URI string G%F6ran.

What you get is an URI encoding of the UTF-8 bytes. I think this is
the expected and correct behavior: there are multiple incompatible
locale encodings and there's no way for this function to know what
encoding you want to use for the URI. It would also fail if you had
characters not representable in the locale encoding.

This is at most a documentation bug. It should be stated that this
function converts the string byte-by-byte, and everything outside of
the 0-127 range is converted to hex escapes. (I think this is the only
sensible behavior for this function.)

Regards, Krzysztof
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list


Re: g_filename_to_uri() issue in glib-win32

2012-05-22 Thread John Emmas
On 23 May 2012, at 00:22, Krzysztof Kosiński wrote:

 
 What you get is an URI encoding of the UTF-8 bytes. I think this is
 the expected and correct behavior: there are multiple incompatible
 locale encodings and there's no way for this function to know what
 encoding you want to use for the URI. It would also fail if you had
 characters not representable in the locale encoding.
 
 This is at most a documentation bug. It should be stated that this
 function converts the string byte-by-byte, and everything outside of
 the 0-127 range is converted to hex escapes.
 

Thanks for the prompt reply Krzysztof,

I can see where you're coming from on this but there's another way to look at 
it.  In my example (Göran) the UTF-8 byte sequence (for my particular code 
page) would have been:-

47 C3 B6 72 61 6E

This would get displayed as:-

G [ some codepage dependent character ] r a n

But whatever that (second) character looked like, it's decimal value would 
always be 246 (because the UTF-8 sequence C3 B6 translates to decimal 246).

The URI translation of decimal 246 is %F6.

Therefore it should be possible to translate from UTF-8 [47 C3 B6 72 61 6E] 
into URI G%F6ran regardless of the user's code page.   On my system this 
would say Göran whereas on someone else's system it might look different but 
that's not really relevant.  The conversion itself is valid and shouldn't be 
affected by code pages.  Code pages will only affect the displayed appearance.

Of course, this is only with my simple example.  There might be other examples 
where my theory breaks down.  I've only considered this particular case.  But 
if what you said was true Krzysztof, g_filename_to_utf8() would suffer from the 
same problem - but it doesn't.  If (on Windows) you pass it a UTF-8 filename, 
it correctly recognises that the name is already UTF-8 and returns the original 
string (i.e. it doesn't attempt a new byte-by-byte conversion).

So 'g_filename_to_uri()' is misbehaving AFAICT.

John
___
gtk-devel-list mailing list
gtk-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-devel-list