Re: [rfbproto] [PATCH] Specify UTF-8 for strings (v2)
Den 2009-09-01 10:21 skrev Pierre Ossman: Steer things towards UTF-8, whilst also adding a notice that historically there has been a lot of different encodings in use. Signed-off-by: Pierre Ossman oss...@cendio.se Yes, please. Cheers, Peter -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ tigervnc-rfbproto mailing list tigervnc-rfbproto@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto
Re: [rfbproto] [PATCH] Specify UTF-8 for strings (v2)
On Tue, Sep 01, 2009 at 10:21:37AM +0200, Pierre Ossman wrote: Steer things towards UTF-8, whilst also adding a notice that historically there has been a lot of different encodings in use. +1 Signed-off-by: Pierre Ossman oss...@cendio.se --- Index: rfbproto.rst === --- rfbproto.rst (revision 3887) +++ rfbproto.rst (working copy) @@ -201,6 +201,34 @@ security types do not clash. Please see the RealVNC website at http://www.realvnc.com for details of how to contact them. +String Encodings + + +The encoding used for strings in the protocol has historically often +been unspecified, or has changed between versions of the protocol. As a +result, there are a lot of implementations which use different, +incompatible encodings. Commonly those encodings have been ISO 8859-1 +(also known as Latin-1) or Windows code pages. + +It is strongly recommended that new implementations use the UTF-8 +encoding for these strings. This allows full unicode support, yet +retains good compatibility with older RFB implementations. + +New protocol additions that do not have a legacy problem should mandate +the UTF-8 encoding to provide full character support and to avoid any +issues with ambiguity. + +All clients and servers should be prepared to receive invalid UTF-8 +sequences at all times. These can occur as a result of historical +ambiguity or because of bugs. Neither case should result in lost +protocol synchronization. + +Handling an invalid UTF-8 sequence is largely dependent on the role +that string plays. Modifying the string should only be done when the +string is only used in the user interface. It should be obvious in that +case that the string has been modified, e.g. by appending a notice to +the string. + Protocol Messages = @@ -614,8 +642,12 @@ *name-length* ``U8`` array*name-string* === === === -where ``PIXEL_FORMAT`` is +The text encoding used for *name-string* is historically undefined but +it is strongly recommended to use UTF-8 (see `String Encodings`_ for +more details). +``PIXEL_FORMAT`` is defined as: + === === === No. of bytesTypeDescription === === === -- Pierre OssmanOpenSource-based Thin Client Technology System Developer Telephone: +46-13-21 46 00 Cendio ABWeb: http://www.cendio.com -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ tigervnc-rfbproto mailing list tigervnc-rfbproto@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto -- Adam Tkac, Red Hat, Inc. -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ tigervnc-rfbproto mailing list tigervnc-rfbproto@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto
Re: [rfbproto] [PATCH] Specify UTF-8 for strings
On Mon, Aug 17, 2009 at 11:20:08AM +0200, Peter Rosin wrote: Den 2009-08-17 10:59 skrev Adam Tkac: On Mon, Aug 17, 2009 at 10:22:56AM +0200, Peter Rosin wrote: If it is so natural with UTF-8 and if it really is the only sane choise (I think it is), it's enough if our spec says (e.g.) It is strongly recommended that all implementations use UTF-8 for all strings (except explicitely stated otherwise) to ensure interoperability. But be prepared that not all implementation do, so fail gracefully if you receive something else. instead of (e.g.) All implementations MUST use UTF-8 for all strings (except explicitely stated otherwise). But not all implementations do, so you SHOULD fail gracefully if you receive something else. I just don't see why the wording with MUST/SHOULD is so superior that it is worth rendering existing implementations incompatible with our spec. This is ok with me. I don't think there's any difference in practice. Oh, cool. Pierre previously asked if I had any alternative wording, so here is my suggestion: diff --git a/rfbproto.rst b/rfbproto.rst index 7852746..0252e4f 100644 --- a/rfbproto.rst +++ b/rfbproto.rst @@ -201,6 +201,26 @@ that you contact RealVNC Ltd to make sure that your encodin security types do not clash. Please see the RealVNC website at http://www.realvnc.com for details of how to contact them. +String Encodings + + +It is strongly recommended that strings in RFB are encoded using the +UTF-8 encoding. This allows full unicode support, yet retains good +compatibility with older RFB implementations. + +The encoding used for strings in the protocol has historically often +been unspecified, or has changed between versions of the protocol. As a +result, there are a lot of implementations which use different, +incompatible encodings. Commonly those encodings have been ISO 8859-1 +(also known as Latin-1) or Windows code pages. + +Clients and servers are encouraged to send UTF-8 strings unless that +particular part of the protocol mandates another encoding. They should +however be prepared to receive invalid UTF-8 sequences at all times. +Such sequences should be handled gracefully by e.g. stripping the +invalid portions or trying to interpret the string using common +encodings such as ISO 8859-1 or Windows code page 1252. + Hm, it is easy to say invalid portions of UTF-8 string but it is _very_ hard to create an algorithm which will determine if a part of string is valid or invalid. If you are using UTF-8 users might create strings with obscure characters. I think this kind of heuristic should not be included in protocol. The only thing I changed from the original patch (by Pierre) in the last three lines was to add e.g., so that implementors would have a choice of doing something else if they liked to. But is it really hard to determine UTF-8 validity? I think that is exactly one of the nice properties of UTF-8. Quoting from the UTF-8 article on wikipedia: Because the starting and continuation bytes are distinct sets, UTF-8 is self-synchronizing. Character boundaries are easily found when searching either forwards or backwards. If bytes are lost due to error or corruption, one can always locate the beginning of the next character and thus limit the damage. Many multi-byte encodings are much harder to resynchronize. Or are you talking about something else? If an implementation sends strings in, for example, the ISO 8859-* encoding it will end with crippled characters but we have to live with it, there is probably no algorithm to solve this problem. You could have an option that says, if a string has errors according to UTF-8, treat it as ISO 8859-1 (substitute for your preferred encoding). Yes, something like that sounds better for me. I attached improved (I hope it is an improvement ;)) specification of strings. Regards, Adam -- Adam Tkac, Red Hat, Inc. Index: rfbproto.rst === --- rfbproto.rst(revision 3871) +++ rfbproto.rst(working copy) @@ -201,6 +201,25 @@ security types do not clash. Please see the RealVNC website at http://www.realvnc.com for details of how to contact them. +String Encodings + + +The encoding used for strings in the protocol has historically often +been unspecified, or has changed between versions of the protocol. As a +result, there are a lot of implementations which use different, +incompatible encodings. Commonly those encodings have been ISO 8859-1 +(also known as Latin-1) or Windows code pages. + +All new implementations should encode strings in UTF-8 unless the +particular part of the protocol mandates another encoding. This allows +full Unicode support, yet retains good compatibility with older +RFB implementations. + +If a string has errors according to UTF-8, try to treat it
Re: [rfbproto] [PATCH] Specify UTF-8 for strings
Den 2009-08-17 12:47 skrev Adam Tkac: On Mon, Aug 17, 2009 at 11:20:08AM +0200, Peter Rosin wrote: Den 2009-08-17 10:59 skrev Adam Tkac: On Mon, Aug 17, 2009 at 10:22:56AM +0200, Peter Rosin wrote: If it is so natural with UTF-8 and if it really is the only sane choise (I think it is), it's enough if our spec says (e.g.) It is strongly recommended that all implementations use UTF-8 for all strings (except explicitely stated otherwise) to ensure interoperability. But be prepared that not all implementation do, so fail gracefully if you receive something else. instead of (e.g.) All implementations MUST use UTF-8 for all strings (except explicitely stated otherwise). But not all implementations do, so you SHOULD fail gracefully if you receive something else. I just don't see why the wording with MUST/SHOULD is so superior that it is worth rendering existing implementations incompatible with our spec. This is ok with me. I don't think there's any difference in practice. Oh, cool. Pierre previously asked if I had any alternative wording, so here is my suggestion: diff --git a/rfbproto.rst b/rfbproto.rst index 7852746..0252e4f 100644 --- a/rfbproto.rst +++ b/rfbproto.rst @@ -201,6 +201,26 @@ that you contact RealVNC Ltd to make sure that your encodin security types do not clash. Please see the RealVNC website at http://www.realvnc.com for details of how to contact them. +String Encodings + + +It is strongly recommended that strings in RFB are encoded using the +UTF-8 encoding. This allows full unicode support, yet retains good +compatibility with older RFB implementations. + +The encoding used for strings in the protocol has historically often +been unspecified, or has changed between versions of the protocol. As a +result, there are a lot of implementations which use different, +incompatible encodings. Commonly those encodings have been ISO 8859-1 +(also known as Latin-1) or Windows code pages. + +Clients and servers are encouraged to send UTF-8 strings unless that +particular part of the protocol mandates another encoding. They should +however be prepared to receive invalid UTF-8 sequences at all times. +Such sequences should be handled gracefully by e.g. stripping the +invalid portions or trying to interpret the string using common +encodings such as ISO 8859-1 or Windows code page 1252. + Hm, it is easy to say invalid portions of UTF-8 string but it is _very_ hard to create an algorithm which will determine if a part of string is valid or invalid. If you are using UTF-8 users might create strings with obscure characters. I think this kind of heuristic should not be included in protocol. The only thing I changed from the original patch (by Pierre) in the last three lines was to add e.g., so that implementors would have a choice of doing something else if they liked to. But is it really hard to determine UTF-8 validity? I think that is exactly one of the nice properties of UTF-8. Quoting from the UTF-8 article on wikipedia: Because the starting and continuation bytes are distinct sets, UTF-8 is self-synchronizing. Character boundaries are easily found when searching either forwards or backwards. If bytes are lost due to error or corruption, one can always locate the beginning of the next character and thus limit the damage. Many multi-byte encodings are much harder to resynchronize. Or are you talking about something else? If an implementation sends strings in, for example, the ISO 8859-* encoding it will end with crippled characters but we have to live with it, there is probably no algorithm to solve this problem. You could have an option that says, if a string has errors according to UTF-8, treat it as ISO 8859-1 (substitute for your preferred encoding). Yes, something like that sounds better for me. I attached improved (I hope it is an improvement ;)) specification of strings. *snip* +All new implementations should encode strings in UTF-8 unless the Sorry, but it's not an improvement if you reintroduce either of the magic words SHOULD and MUST in this context. I'm obviously taking to deaf ears. And my suggested option was just some configuration option in some implementation, I did not intend for that to go into the spec. I agree with Peter Åstrand that the specific names of any fallback encodings should probably be left out of the spec. Cheers, Peter -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ tigervnc-rfbproto mailing list tigervnc-rfbproto@lists.sourceforge.net
Re: [rfbproto] [PATCH] Specify UTF-8 for strings
Den 2009-08-17 11:45 skrev Peter Åstrand: No we didn't, we agreed on that for the desktop name. Refresh my memory - which other strings are sent as ANSI CODE PAGE? Username and password in the VeNCrypt extension. There are some strings in the gii extension. The tight file transfer extension sends filenames. And I'm sure I'm forgetting at least some string, that was just off the top of my head... But neither on those are in the protocol at this point. And the tight file transfer is heavily deprecated, even by TightVNC. You better check that, as gii is in the spec. You continue to ignore and look past everything that smells like a problem, which is not very comforting. You seem too eager to add UTF-8 to the spec. And I'd say they are all in the protocol, just not in our specification of the protocol, so it's still a problem in my book. If we add this language now, we will erect a barrier making it harder to add those other protocol extensions to our spec. I think the burden should be on those trying to do new things (which is to specify UTF-8 for all strings) and not on those documenting existing extensions/behaviours. Oh, and there are also strings in the SASL extension. I think it would be worth more to have SASL, VeNCrypt and Tight file transfer documented (even if deprecated) in the spec than to add some language about UTF-8. BTW, where is this heavy deprecation of tight file transfers documented? Nailing to ASCII is worse than nailing to UTF-8. Both make our spec incompatible with existing implementations. We have to allow for implementations to do whatever non-UTF-8 thingy they have been doing, but still recommend against it. I would say: Don't give people rope. If we start documenting that full UTF-8 intl:ed strings are allowed in, say, ProtocolVersion, I'm sure we will soon se a server that presents itself as RFB 003.008übuñtü or something like that... Come on, that's a stupid argument since ProtocolVersion is specified as ASCII and to be 12 characters long. And to be in a very specific format. By everybody. We are not. It's just that clients that relied on recieving the DesktopName in something else than UTF-8 was on their own and relied on unspecified protocol behaviour. Reversing that argument is so easy, Xvnc were on its own when it relied on unspecified protocol behaviour... True, but you can't avoid the fact that the UTF-8 variant is, currently, transmitted over the wire. True, but you can't avoid the fact that some implementations expect, currently, that the data instead should have been transmitted using CP-1252, or ISO 8859-1, or etc... Today, when there is encoding disconnect, it is the fault of noone (except the spec). If we now start to specify one thing as correct, we will shift the balance and make everybody in the UTF-8 camp right. And everybody else wrong. I don't think that's fair. But I still think we can specify one thing as superior. By doing that the alternative options are no longer wrong, just inferior. Strange language. We are not forbidden any clients. It's true that a few clients could theoretically start rendering the names incorrectly, but... But we seriously do not want to divide the RFB community (any further), and we are doing that if we say MUST use UTF-8. With a MUST in there, anything else is not acceptable, hence illegal by our spec. I see no problem with that our document is somewhat stricter than the RealVNC one, when it comes to previously undefined things. In that we differ. Cheers, Peter -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ tigervnc-rfbproto mailing list tigervnc-rfbproto@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto