Re: [rfbproto] [PATCH] Specify UTF-8 for strings (v2)

2009-09-02 Thread Peter Rosin
Den 2009-09-01 10:21 skrev Pierre Ossman:
 Steer things towards UTF-8, whilst also adding a notice that
 historically there has been a lot of different encodings in use.
 
 Signed-off-by: Pierre Ossman oss...@cendio.se

Yes, please.

Cheers,
Peter

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
tigervnc-rfbproto mailing list
tigervnc-rfbproto@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto


Re: [rfbproto] [PATCH] Specify UTF-8 for strings (v2)

2009-09-01 Thread Adam Tkac
On Tue, Sep 01, 2009 at 10:21:37AM +0200, Pierre Ossman wrote:
 Steer things towards UTF-8, whilst also adding a notice that
 historically there has been a lot of different encodings in use.
 

+1

 Signed-off-by: Pierre Ossman oss...@cendio.se
 ---
 
 Index: rfbproto.rst
 ===
 --- rfbproto.rst  (revision 3887)
 +++ rfbproto.rst  (working copy)
 @@ -201,6 +201,34 @@
  security types do not clash. Please see the RealVNC website at
  http://www.realvnc.com for details of how to contact them.
  
 +String Encodings
 +
 +
 +The encoding used for strings in the protocol has historically often
 +been unspecified, or has changed between versions of the protocol. As a
 +result, there are a lot of implementations which use different,
 +incompatible encodings. Commonly those encodings have been ISO 8859-1
 +(also known as Latin-1) or Windows code pages.
 +
 +It is strongly recommended that new implementations use the UTF-8
 +encoding for these strings. This allows full unicode support, yet
 +retains good compatibility with older RFB implementations.
 +
 +New protocol additions that do not have a legacy problem should mandate
 +the UTF-8 encoding to provide full character support and to avoid any
 +issues with ambiguity.
 +
 +All clients and servers should be prepared to receive invalid UTF-8
 +sequences at all times. These can occur as a result of historical
 +ambiguity or because of bugs. Neither case should result in lost
 +protocol synchronization.
 +
 +Handling an invalid UTF-8 sequence is largely dependent on the role
 +that string plays. Modifying the string should only be done when the
 +string is only used in the user interface. It should be obvious in that
 +case that the string has been modified, e.g. by appending a notice to
 +the string.
 +
  Protocol Messages
  =
  
 @@ -614,8 +642,12 @@
  *name-length*   ``U8`` array*name-string*
  === === ===
  
 -where ``PIXEL_FORMAT`` is
 +The text encoding used for *name-string* is historically undefined but
 +it is strongly recommended to use UTF-8 (see `String Encodings`_ for
 +more details).
  
 +``PIXEL_FORMAT`` is defined as:
 +
  === === ===
  No. of bytesTypeDescription
  === === ===
 
 
 
 -- 
 Pierre OssmanOpenSource-based Thin Client Technology
 System Developer Telephone: +46-13-21 46 00
 Cendio ABWeb: http://www.cendio.com



 --
 Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
 trial. Simplify your report design, integration and deployment - and focus on 
 what you do best, core application coding. Discover what's new with 
 Crystal Reports now.  http://p.sf.net/sfu/bobj-july
 ___
 tigervnc-rfbproto mailing list
 tigervnc-rfbproto@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto


-- 
Adam Tkac, Red Hat, Inc.

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
tigervnc-rfbproto mailing list
tigervnc-rfbproto@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto


Re: [rfbproto] [PATCH] Specify UTF-8 for strings

2009-08-17 Thread Adam Tkac
On Mon, Aug 17, 2009 at 11:20:08AM +0200, Peter Rosin wrote:
 Den 2009-08-17 10:59 skrev Adam Tkac:
 On Mon, Aug 17, 2009 at 10:22:56AM +0200, Peter Rosin wrote:
 If it is so natural with UTF-8 and if it really is the only sane choise
 (I think it is), it's enough if our spec says (e.g.)

 It is strongly recommended that all implementations use
 UTF-8 for all strings (except explicitely stated otherwise)
 to ensure interoperability. But be prepared that not all
 implementation do, so fail gracefully if you receive
 something else.

 instead of (e.g.)

 All implementations MUST use UTF-8 for all strings (except
 explicitely stated otherwise). But not all implementations
 do, so you SHOULD fail gracefully if you receive something
 else.

 I just don't see why the wording with MUST/SHOULD is so superior
 that it is worth rendering existing implementations incompatible
 with our spec.
 This is ok with me. I don't think there's any difference in practice.
 Oh, cool. Pierre previously asked if I had any alternative wording,
 so here is my suggestion:

 diff --git a/rfbproto.rst b/rfbproto.rst
 index 7852746..0252e4f 100644
 --- a/rfbproto.rst
 +++ b/rfbproto.rst
 @@ -201,6 +201,26 @@ that you contact RealVNC Ltd to make sure that your 
 encodin security types do not clash. Please see the RealVNC website at
   http://www.realvnc.com for details of how to contact them.

 +String Encodings
 +
 +
 +It is strongly recommended that strings in RFB are encoded using the
 +UTF-8 encoding. This allows full unicode support, yet retains good
 +compatibility with older RFB implementations.
 +
 +The encoding used for strings in the protocol has historically often
 +been unspecified, or has changed between versions of the protocol. As a
 +result, there are a lot of implementations which use different,
 +incompatible encodings. Commonly those encodings have been ISO 8859-1
 +(also known as Latin-1) or Windows code pages.
 +
 +Clients and servers are encouraged to send UTF-8 strings unless that
 +particular part of the protocol mandates another encoding. They should
 +however be prepared to receive invalid UTF-8 sequences at all times.
 +Such sequences should be handled gracefully by e.g. stripping the
 +invalid portions or trying to interpret the string using common
 +encodings such as ISO 8859-1 or Windows code page 1252.
 +

 Hm, it is easy to say invalid portions of UTF-8 string but it is
 _very_ hard to create an algorithm which will determine if a part of
 string is valid or invalid. If you are using UTF-8 users might create
 strings with obscure characters. I think this kind of heuristic
 should not be included in protocol.

 The only thing I changed from the original patch (by Pierre) in the
 last three lines was to add e.g., so that implementors
 would have a choice of doing something else if they liked to.

 But is it really hard to determine UTF-8 validity? I think that is
 exactly one of the nice properties of UTF-8. Quoting from the UTF-8
 article on wikipedia:

   Because the starting and continuation bytes are distinct sets,
   UTF-8 is self-synchronizing. Character boundaries are easily
   found when searching either forwards or backwards. If bytes
   are lost due to error or corruption, one can always locate
   the beginning of the next character and thus limit the damage.
   Many multi-byte encodings are much harder to resynchronize.

 Or are you talking about something else?

 If an implementation sends strings in, for example, the ISO 8859-*
 encoding it will end with crippled characters but we have to live
 with it, there is probably no algorithm to solve this problem.

 You could have an option that says, if a string has errors according
 to UTF-8, treat it as ISO 8859-1 (substitute for your preferred
 encoding).

Yes, something like that sounds better for me. I attached improved (I
hope it is an improvement ;)) specification of strings.

Regards, Adam

-- 
Adam Tkac, Red Hat, Inc.
Index: rfbproto.rst
===
--- rfbproto.rst(revision 3871)
+++ rfbproto.rst(working copy)
@@ -201,6 +201,25 @@
 security types do not clash. Please see the RealVNC website at
 http://www.realvnc.com for details of how to contact them.
 
+String Encodings
+
+
+The encoding used for strings in the protocol has historically often
+been unspecified, or has changed between versions of the protocol. As a
+result, there are a lot of implementations which use different,
+incompatible encodings. Commonly those encodings have been ISO 8859-1
+(also known as Latin-1) or Windows code pages.
+
+All new implementations should encode strings in UTF-8 unless the
+particular part of the protocol mandates another encoding. This allows
+full Unicode support, yet retains good compatibility with older
+RFB implementations.
+
+If a string has errors according to UTF-8, try to treat it 

Re: [rfbproto] [PATCH] Specify UTF-8 for strings

2009-08-17 Thread Peter Rosin
Den 2009-08-17 12:47 skrev Adam Tkac:
 On Mon, Aug 17, 2009 at 11:20:08AM +0200, Peter Rosin wrote:
 Den 2009-08-17 10:59 skrev Adam Tkac:
 On Mon, Aug 17, 2009 at 10:22:56AM +0200, Peter Rosin wrote:
 If it is so natural with UTF-8 and if it really is the only sane choise
 (I think it is), it's enough if our spec says (e.g.)

 It is strongly recommended that all implementations use
 UTF-8 for all strings (except explicitely stated otherwise)
 to ensure interoperability. But be prepared that not all
 implementation do, so fail gracefully if you receive
 something else.

 instead of (e.g.)

 All implementations MUST use UTF-8 for all strings (except
 explicitely stated otherwise). But not all implementations
 do, so you SHOULD fail gracefully if you receive something
 else.

 I just don't see why the wording with MUST/SHOULD is so superior
 that it is worth rendering existing implementations incompatible
 with our spec.
 This is ok with me. I don't think there's any difference in practice.
 Oh, cool. Pierre previously asked if I had any alternative wording,
 so here is my suggestion:

 diff --git a/rfbproto.rst b/rfbproto.rst
 index 7852746..0252e4f 100644
 --- a/rfbproto.rst
 +++ b/rfbproto.rst
 @@ -201,6 +201,26 @@ that you contact RealVNC Ltd to make sure that your 
 encodin security types do not clash. Please see the RealVNC website at
   http://www.realvnc.com for details of how to contact them.

 +String Encodings
 +
 +
 +It is strongly recommended that strings in RFB are encoded using the
 +UTF-8 encoding. This allows full unicode support, yet retains good
 +compatibility with older RFB implementations.
 +
 +The encoding used for strings in the protocol has historically often
 +been unspecified, or has changed between versions of the protocol. As a
 +result, there are a lot of implementations which use different,
 +incompatible encodings. Commonly those encodings have been ISO 8859-1
 +(also known as Latin-1) or Windows code pages.
 +
 +Clients and servers are encouraged to send UTF-8 strings unless that
 +particular part of the protocol mandates another encoding. They should
 +however be prepared to receive invalid UTF-8 sequences at all times.
 +Such sequences should be handled gracefully by e.g. stripping the
 +invalid portions or trying to interpret the string using common
 +encodings such as ISO 8859-1 or Windows code page 1252.
 +
 Hm, it is easy to say invalid portions of UTF-8 string but it is
 _very_ hard to create an algorithm which will determine if a part of
 string is valid or invalid. If you are using UTF-8 users might create
 strings with obscure characters. I think this kind of heuristic
 should not be included in protocol.
 The only thing I changed from the original patch (by Pierre) in the
 last three lines was to add e.g., so that implementors
 would have a choice of doing something else if they liked to.

 But is it really hard to determine UTF-8 validity? I think that is
 exactly one of the nice properties of UTF-8. Quoting from the UTF-8
 article on wikipedia:

  Because the starting and continuation bytes are distinct sets,
  UTF-8 is self-synchronizing. Character boundaries are easily
  found when searching either forwards or backwards. If bytes
  are lost due to error or corruption, one can always locate
  the beginning of the next character and thus limit the damage.
  Many multi-byte encodings are much harder to resynchronize.

 Or are you talking about something else?

 If an implementation sends strings in, for example, the ISO 8859-*
 encoding it will end with crippled characters but we have to live
 with it, there is probably no algorithm to solve this problem.
 You could have an option that says, if a string has errors according
 to UTF-8, treat it as ISO 8859-1 (substitute for your preferred
 encoding).
 
 Yes, something like that sounds better for me. I attached improved (I
 hope it is an improvement ;)) specification of strings.

*snip*

 +All new implementations should encode strings in UTF-8 unless the

Sorry, but it's not an improvement if you reintroduce either of the
magic words SHOULD and MUST in this context. I'm obviously taking
to deaf ears.

And my suggested option was just some configuration option in some
implementation, I did not intend for that to go into the spec. I agree
with Peter Åstrand that the specific names of any fallback encodings
should probably be left out of the spec.

Cheers,
Peter

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
tigervnc-rfbproto mailing list
tigervnc-rfbproto@lists.sourceforge.net

Re: [rfbproto] [PATCH] Specify UTF-8 for strings

2009-08-17 Thread Peter Rosin
Den 2009-08-17 11:45 skrev Peter Åstrand:
 No we didn't, we agreed on that for the desktop name.

 Refresh my memory - which other strings are sent as ANSI CODE PAGE?

 Username and password in the VeNCrypt extension. There are some strings
 in the gii extension. The tight file transfer extension sends filenames.
 And I'm sure I'm forgetting at least some string, that was just off the
 top of my head...
 
 But neither on those are in the protocol at this point. And the tight 
 file transfer is heavily deprecated, even by TightVNC.

You better check that, as gii is in the spec. You continue to ignore and
look past everything that smells like a problem, which is not very
comforting. You seem too eager to add UTF-8 to the spec.

And I'd say they are all in the protocol, just not in our
specification of the protocol, so it's still a problem in my book.
If we add this language now, we will erect a barrier making it harder
to add those other protocol extensions to our spec. I think the burden
should be on those trying to do new things (which is to specify
UTF-8 for all strings) and not on those documenting existing
extensions/behaviours.

Oh, and there are also strings in the SASL extension.

I think it would be worth more to have SASL, VeNCrypt and Tight
file transfer documented (even if deprecated) in the spec than
to add some language about UTF-8.

BTW, where is this heavy deprecation of tight file transfers
documented?

 Nailing to ASCII is worse than nailing to UTF-8. Both make our spec
 incompatible with existing implementations. We have to allow for
 implementations to do whatever non-UTF-8 thingy they have been
 doing, but still recommend against it.
 
 I would say: Don't give people rope. If we start documenting that full 
 UTF-8 intl:ed strings are allowed in, say, ProtocolVersion, I'm sure we 
 will soon se a server that presents itself as RFB 003.008übuñtü or 
 something like that...

Come on, that's a stupid argument since ProtocolVersion is specified
as ASCII and to be 12 characters long. And to be in a very specific
format. By everybody.

 We are not. It's just that clients that relied on recieving the 
 DesktopName in something else than UTF-8 was on their own and 
 relied on unspecified protocol behaviour.

 Reversing that argument is so easy, Xvnc were on its own when it relied
 on unspecified protocol behaviour...
 
 True, but you can't avoid the fact that the UTF-8 variant is, currently, 
 transmitted over the wire.

True, but you can't avoid the fact that some implementations expect,
currently, that the data instead should have been transmitted using
CP-1252, or ISO 8859-1, or etc...

Today, when there is encoding disconnect, it is the fault of noone
(except the spec). If we now start to specify one thing as correct,
we will shift the balance and make everybody in the UTF-8 camp
right. And everybody else wrong. I don't think that's fair. But
I still think we can specify one thing as superior. By doing that
the alternative options are no longer wrong, just inferior.

 Strange language. We are not forbidden any clients. It's true that a 
 few clients could theoretically start rendering the names 
 incorrectly, but...

 But we seriously do not want to divide the RFB community (any further),
 and we are doing that if we say MUST use UTF-8. With a MUST in there,
 anything else is not acceptable, hence illegal by our spec.
 
 I see no problem with that our document is somewhat stricter than the 
 RealVNC one, when it comes to previously undefined things.

In that we differ.

Cheers,
Peter

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
tigervnc-rfbproto mailing list
tigervnc-rfbproto@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tigervnc-rfbproto