Re: string vs. bytes

2009-05-12 Thread dan . schmidt . valle

I am having a very similar problem. Just installed the 2.0.3 version
and now all my serialisations complain.

libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
string containing invalid UTF-8 data while parsing protocol buffer.
Strings must contain only UTF-8; use the 'bytes' type for raw bytes.

Now, C++ doesn't have a byte type. Just signed or unsigned chars, and
string is an array of those. So, what does it need? Would I be better
off serialising to a stream like the CodedStream?

I am very confused on the issue. I have the horrible feeling now that
I'm losing efficiency because serialising to string might mean that
I'm losing my raw data.

Otherwise, then the word ERROR on the output might be a bit too
strong.

If anybody can clarify, I'd be very grateful.

Dan

On May 10, 5:59 pm, Henner Zeller h.zel...@acm.org wrote:
 On Sun, May 10, 2009 at 6:08 AM, edan edan...@gmail.com wrote:
  I have some fields that may contain non-UTF8 data.
  I understand that I just need to change their type from string to bytes
  and it should just work, transparently.

 yes. The're the same on the wire.

  I have a few fields that probably will only contain ASCII i.e. legal UTF8,
  but I'm not 100% sure.
  I am tempted to just turn them all to bytes.
  But this begs the question - what is the string type useful for, and why
  shouldn't I just always use bytes to be sure, all the time, and not both
  with string at all?
  Does string add anything besides validation that only valid UTF8 is
  passing over the wire?  Is there really a big benefit to this behavior?  Or
  is there some other advantage that I'll miss out on by changing all my
  strings to bytes?

 If you use the C++ api there is not much difference since both types
 are represented as std::string in the API. It makes a big difference
 for the Java API (and Python?), that have a native type for an UTF-8
 string. In Java, if you deal with a protocol buffer 'string' type, the
 generated API will return a java.lang.String while otherwise it will
 return a ByteString. ByteString can hold any character while the
 native Java String works only for UTF-8. So while 'ByteString' is more
 flexible, 'String' is more convenient to deal with within Java code
 because all string manipulation libraries can handle it.

 So the benefit is a more convenient Api in the generated Java code.
 And as well documentation: if you use 'string' you emphasize that a
 field only contains readable text while 'bytes' might contain any
 binary blob.

 -h
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: string vs. bytes

2009-05-12 Thread Henner Zeller

Hi,
On Tue, May 12, 2009 at 6:47 AM,  dan.schmidt.va...@gmail.com wrote:

 I am having a very similar problem. Just installed the 2.0.3 version
 and now all my serialisations complain.

 libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
 string containing invalid UTF-8 data while parsing protocol buffer.
 Strings must contain only UTF-8; use the 'bytes' type for raw bytes.

 Now, C++ doesn't have a byte type. Just signed or unsigned chars, and
 string is an array of those.

the ProtocolBuffer 'byte' type translates into 'string' in C++. And an
array of chars is an array of bytes, so you're all fine.

 So, what does it need? Would I be better
 off serialising to a stream like the CodedStream?

 I am very confused on the issue. I have the horrible feeling now that
 I'm losing efficiency because serialising to string might mean that
 I'm losing my raw data.

 Otherwise, then the word ERROR on the output might be a bit too
 strong.

 If anybody can clarify, I'd be very grateful.

 Dan

 On May 10, 5:59 pm, Henner Zeller h.zel...@acm.org wrote:
 On Sun, May 10, 2009 at 6:08 AM, edan edan...@gmail.com wrote:
  I have some fields that may contain non-UTF8 data.
  I understand that I just need to change their type from string to bytes
  and it should just work, transparently.

 yes. The're the same on the wire.

  I have a few fields that probably will only contain ASCII i.e. legal UTF8,
  but I'm not 100% sure.
  I am tempted to just turn them all to bytes.
  But this begs the question - what is the string type useful for, and why
  shouldn't I just always use bytes to be sure, all the time, and not both
  with string at all?
  Does string add anything besides validation that only valid UTF8 is
  passing over the wire?  Is there really a big benefit to this behavior?  Or
  is there some other advantage that I'll miss out on by changing all my
  strings to bytes?

 If you use the C++ api there is not much difference since both types
 are represented as std::string in the API. It makes a big difference
 for the Java API (and Python?), that have a native type for an UTF-8
 string. In Java, if you deal with a protocol buffer 'string' type, the
 generated API will return a java.lang.String while otherwise it will
 return a ByteString. ByteString can hold any character while the
 native Java String works only for UTF-8. So while 'ByteString' is more
 flexible, 'String' is more convenient to deal with within Java code
 because all string manipulation libraries can handle it.

 So the benefit is a more convenient Api in the generated Java code.
 And as well documentation: if you use 'string' you emphasize that a
 field only contains readable text while 'bytes' might contain any
 binary blob.

 -h
 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: string vs. bytes

2009-05-12 Thread dan . schmidt . valle

Thanks very much for the answers guys. Most illustrative. The error
messages did in fact disappear with that simple change in all my proto
files.

Still, now that this error has shown in the code I have, I keep
wondering whether the fact that I'm serialising to string is
inefficient. What would be the case for using serialisation to a
stream then?

Thanks again for the help.

Dan

On May 12, 5:26 pm, Kenton Varda ken...@google.com wrote:
 Protocol Buffers has a bytes type.  That's what it's talking about.  Just
 change string to bytes in your .proto file.  (They work exactly the same
 in C++ but are different in Java and Python.)

 On Tue, May 12, 2009 at 6:47 AM, dan.schmidt.va...@gmail.com wrote:

  I am having a very similar problem. Just installed the 2.0.3 version
  and now all my serialisations complain.

  libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
  string containing invalid UTF-8 data while parsing protocol buffer.
  Strings must contain only UTF-8; use the 'bytes' type for raw bytes.

  Now, C++ doesn't have a byte type. Just signed or unsigned chars, and
  string is an array of those. So, what does it need? Would I be better
  off serialising to a stream like the CodedStream?

  I am very confused on the issue. I have the horrible feeling now that
  I'm losing efficiency because serialising to string might mean that
  I'm losing my raw data.

  Otherwise, then the word ERROR on the output might be a bit too
  strong.

  If anybody can clarify, I'd be very grateful.

  Dan

  On May 10, 5:59 pm, Henner Zeller h.zel...@acm.org wrote:
   On Sun, May 10, 2009 at 6:08 AM, edan edan...@gmail.com wrote:
I have some fields that may contain non-UTF8 data.
I understand that I just need to change their type from string to
  bytes
and it should just work, transparently.

   yes. The're the same on the wire.

I have a few fields that probably will only contain ASCII i.e. legal
  UTF8,
but I'm not 100% sure.
I am tempted to just turn them all to bytes.
But this begs the question - what is the string type useful for, and
  why
shouldn't I just always use bytes to be sure, all the time, and not
  both
with string at all?
Does string add anything besides validation that only valid UTF8 is
passing over the wire?  Is there really a big benefit to this
  behavior?  Or
is there some other advantage that I'll miss out on by changing all my
strings to bytes?

   If you use the C++ api there is not much difference since both types
   are represented as std::string in the API. It makes a big difference
   for the Java API (and Python?), that have a native type for an UTF-8
   string. In Java, if you deal with a protocol buffer 'string' type, the
   generated API will return a java.lang.String while otherwise it will
   return a ByteString. ByteString can hold any character while the
   native Java String works only for UTF-8. So while 'ByteString' is more
   flexible, 'String' is more convenient to deal with within Java code
   because all string manipulation libraries can handle it.

   So the benefit is a more convenient Api in the generated Java code.
   And as well documentation: if you use 'string' you emphasize that a
   field only contains readable text while 'bytes' might contain any
   binary blob.

   -h
--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: string vs. bytes

2009-05-12 Thread Kenton Varda
The serialized message is just an array of bytes.  We use std::string as an
efficient container for these bytes, but it is still just storing bytes.
 std::string, unlike Java's String, only contains bytes, not unicode
characters.  So, there is no performance penalty.  In fact, serializing to a
string is typically much faster than serializing to an abstract stream,
especially with v2.1.0, since the code does not need to perform bounds
checks (since it pre-allocates a string that is guaranteed to be large
enough).  The only case where you would not want to serialize to a string is
if your message is very big, since some memory allocators do not behave well
when allocating large contiguous blocks of memory.  In this case, using
streams allows the message to be written one piece at a time.

On Tue, May 12, 2009 at 10:43 AM, dan.schmidt.va...@gmail.com wrote:


 Thanks very much for the answers guys. Most illustrative. The error
 messages did in fact disappear with that simple change in all my proto
 files.

 Still, now that this error has shown in the code I have, I keep
 wondering whether the fact that I'm serialising to string is
 inefficient. What would be the case for using serialisation to a
 stream then?

 Thanks again for the help.

 Dan

 On May 12, 5:26 pm, Kenton Varda ken...@google.com wrote:
  Protocol Buffers has a bytes type.  That's what it's talking about.
  Just
  change string to bytes in your .proto file.  (They work exactly the
 same
  in C++ but are different in Java and Python.)
 
  On Tue, May 12, 2009 at 6:47 AM, dan.schmidt.va...@gmail.com wrote:
 
   I am having a very similar problem. Just installed the 2.0.3 version
   and now all my serialisations complain.
 
   libprotobuf ERROR ./google/protobuf/wire_format_inl.h:138] Encountered
   string containing invalid UTF-8 data while parsing protocol buffer.
   Strings must contain only UTF-8; use the 'bytes' type for raw bytes.
 
   Now, C++ doesn't have a byte type. Just signed or unsigned chars, and
   string is an array of those. So, what does it need? Would I be better
   off serialising to a stream like the CodedStream?
 
   I am very confused on the issue. I have the horrible feeling now that
   I'm losing efficiency because serialising to string might mean that
   I'm losing my raw data.
 
   Otherwise, then the word ERROR on the output might be a bit too
   strong.
 
   If anybody can clarify, I'd be very grateful.
 
   Dan
 
   On May 10, 5:59 pm, Henner Zeller h.zel...@acm.org wrote:
On Sun, May 10, 2009 at 6:08 AM, edan edan...@gmail.com wrote:
 I have some fields that may contain non-UTF8 data.
 I understand that I just need to change their type from string to
   bytes
 and it should just work, transparently.
 
yes. The're the same on the wire.
 
 I have a few fields that probably will only contain ASCII i.e.
 legal
   UTF8,
 but I'm not 100% sure.
 I am tempted to just turn them all to bytes.
 But this begs the question - what is the string type useful for,
 and
   why
 shouldn't I just always use bytes to be sure, all the time, and
 not
   both
 with string at all?
 Does string add anything besides validation that only valid UTF8
 is
 passing over the wire?  Is there really a big benefit to this
   behavior?  Or
 is there some other advantage that I'll miss out on by changing all
 my
 strings to bytes?
 
If you use the C++ api there is not much difference since both types
are represented as std::string in the API. It makes a big difference
for the Java API (and Python?), that have a native type for an UTF-8
string. In Java, if you deal with a protocol buffer 'string' type,
 the
generated API will return a java.lang.String while otherwise it will
return a ByteString. ByteString can hold any character while the
native Java String works only for UTF-8. So while 'ByteString' is
 more
flexible, 'String' is more convenient to deal with within Java code
because all string manipulation libraries can handle it.
 
So the benefit is a more convenient Api in the generated Java code.
And as well documentation: if you use 'string' you emphasize that a
field only contains readable text while 'bytes' might contain any
binary blob.
 
-h
 


--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---



Re: string vs. bytes

2009-05-10 Thread Henner Zeller

On Sun, May 10, 2009 at 6:08 AM, edan edan...@gmail.com wrote:
 I have some fields that may contain non-UTF8 data.
 I understand that I just need to change their type from string to bytes
 and it should just work, transparently.

yes. The're the same on the wire.

 I have a few fields that probably will only contain ASCII i.e. legal UTF8,
 but I'm not 100% sure.
 I am tempted to just turn them all to bytes.
 But this begs the question - what is the string type useful for, and why
 shouldn't I just always use bytes to be sure, all the time, and not both
 with string at all?
 Does string add anything besides validation that only valid UTF8 is
 passing over the wire?  Is there really a big benefit to this behavior?  Or
 is there some other advantage that I'll miss out on by changing all my
 strings to bytes?

If you use the C++ api there is not much difference since both types
are represented as std::string in the API. It makes a big difference
for the Java API (and Python?), that have a native type for an UTF-8
string. In Java, if you deal with a protocol buffer 'string' type, the
generated API will return a java.lang.String while otherwise it will
return a ByteString. ByteString can hold any character while the
native Java String works only for UTF-8. So while 'ByteString' is more
flexible, 'String' is more convenient to deal with within Java code
because all string manipulation libraries can handle it.

So the benefit is a more convenient Api in the generated Java code.
And as well documentation: if you use 'string' you emphasize that a
field only contains readable text while 'bytes' might contain any
binary blob.

-h

--~--~-~--~~~---~--~~
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to protobuf@googlegroups.com
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en
-~--~~~~--~~--~--~---