Re: [protobuf] Re: protobuf not handling special characters between Java server and C++ client

2011-01-26 Thread Evan Jones

On Jan 26, 2011, at 3:43 , Hitesh Jethwani wrote:
Can we encode the protobuf data in ISO-8859-1 from the server end  
itself?


Yes. In this case, you need to use the protocol buffer bytes type  
instead of the protocol buffer string type, since you want to  
exchange ISO-8859-1 bytes from program to program (bytes), not unicode  
text (string).


On the Java side, you'll need to use  
ByteString.copyFrom(myStringobject, ISO-8859-1) to make a ByteString  
out of a Java string.


Hope this helps,

Evan

--
http://evanjones.ca/

--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Re: protobuf not handling special characters between Java server and C++ client

2011-01-25 Thread Kenton Varda
On Tue, Jan 25, 2011 at 8:57 PM, Hitesh Jethwani hjethwa...@gmail.comwrote:

  if on the stream writer, I add something like:
  writer.write(new String(msg.getBytes(), UTF8).getBytes()) instead of
  simply writer.write(msg.getBytes()), I see the characters as expected
  on the C++ client. However this I believe messes up with the protobuf
  headers, so on C++ I receive only a partial file upto the entry that
  contains one such character.

 Still not sure on the above though.


The reason this appears to work is because String.getBytes() encodes in
ISO-8859-1 encoding by default.  This encoding represents each character as
exactly one byte, and can only represent character codes U+ through
U+00FF.  Since you are decoding the bytes as UTF-8 and then encoding them as
ISO-8859-1, and since the character 'É' happens to be within the ISO-8859-1
range, you effectively decoded this character into a single byte.  On the
C++ side, the protobuf library does not verify that the parsed bytes are
actually valid UTF-8 (except in debug mode); it just passes them through.
 So the string you see there includes the 'É' character as one byte.

However, you end up getting a parser error because the length of the string
(in bytes) ends up being different from the length given in the encoded
message. The length was originally computed with 'É' represented as two
bytes, but now it is only one byte, so the length is wrong.

In general, decoding arbitrary bytes (like a protobuf) as if they were UTF-8
will lose information, so converting bytes - UTF-8 - bytes will corrupt
the bytes.

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.