[Fwd: Re: Streaming]

2008-12-05 Thread Shane Green

Thanks very much Jon (see below).  You make good points and I like the
approach the you describe.  I am still thinking, however, that there is
power in the ability for message instances to write and parse themselves
from a stream.

A message instance could be passed a stream object which chains back to
the network connection from which bytes are being received.  A stop flag
based parsing mechanism could be passed this buffer object, and would
handle reading the stream and initializing its properties, exiting when
the serialization of that message instance stopped.  At this point, a
new message instance could be created, and the process repeated.  

The type of message doing the parsing could vary from message to
message, even with the serializations being sent and received back to
back.  This mechanism would work regardless of field-types being
streamed.  A message type consisting solely of varint fields, whose
length is determined while reading the varint's value, would support
streaming no differently than any other message type.

The solution also seems to support every requirement supported by the
original buffer type.  Messages serialized to a buffer, could just as
easily be initialized from that buffer as they could from the string
contained by the buffer.

m1 = Message()
buffer = Buffer()
[...] (initialize instance vars)
m1.SerializeToBuffer(buffer)

m2 = Message()
m2.ParseFromBuffer(buffer)

Produces same result as: 

m2 = Message()
bytes = m1.SerializeToString()
m2.ParseFromString(bytes)

The string-based parse would ignore the stop bit, parsing the entire
string.  The buffer-based parsing would stop parsing when the stop bit,
producing the same result.

Handling of concatenated serializations is supported through repeated
calls to parse from buffer:

m1 = Message()
[...] (initialize instance vars)
m2 = Message()
[...] (initialize instance vars)

buffer = Buffer()
m1.SerializeToBuffer(buffer)
m2.SerializeToBuffer(buffer)

m3 = Message()
m3.ParseFromBuffer(buffer)
m3.ParseFromBuffer(buffer)

Would produce same result as:

m3 = Message()
m3.ParseFromString(m1.SerializeToString() + m2.SerializeToString())

As long as an unused, and never to be used, field number is used to
generate the stop bit's key, then I don't believe there are any
incompatibilities between buffer-based message marshalling and the
existing string-based code.  A very easy usage:

# Sending side
for message in messages:
  message.SerializeToBuffer(buffer)

# Receiving side
for msgtype in types:
  message = msgtype()
  message.ParseFromBuffer(buffer)

Unless I've overlooked something, it seems like the stream based
marshalling and unmarshalling is powerful, simple, and completely
compatible with all existing code.  But there is a very real chance I've
overlooked something...




- Shane


 Forwarded Message 
 From: Jon Skeet [EMAIL PROTECTED]
 To: Shane Green [EMAIL PROTECTED]
 Subject: Re: Streaming
 Date: Fri, 5 Dec 2008 08:19:41 +
 
 2008/12/5 Shane Green [EMAIL PROTECTED]
 Thanks Jon.  Those are good points.  I rather liked the
 self-delimiting
 nature of fields, and thought this method would bring that
 feature up to
 the message level, without breaking any of the existing
 capabilities.
 So my goal was a message which could truly be streamed;
 perhaps even
 sent without knowing its own size up front.  Perhaps I
 overlooked
 something?
 
 Currently the PB format requires that you know the size of each
 submessage before you send it. You don't need to know the size of the
 whole message, as it's assumed to be the entire size of the
 datastream. It's unfortunate that you do need to provide the whole
 message to the output stream though, unless you want to manually
 serialize the individual fields.
 
 My goal was slightly different - I wanted to be able to stream a
 sequence of messages. The most obvious use case (in my view) is a log.
 Write out a massive log file as a sequence of entries, and you can
 read it back in one at a time. It's not designed to help to stream a
 single huge message though.
  
 Would you mind if I resent my questions to the group?  I lack
 confidence and wanted to make sure I wasn't overlooking
 something
 ridiculous, but am thinking that the exchange would be
 informative.
 
 Absolutely. Feel free to quote anything I've written if you think it
 helps.
 
 Also, how are you serializing and parsing messages as if they
 are
 repeated fields of a container message?  Is there a fair bit
 of parsing
 or work being done outside the standard protocol-buffer APIs?
 
 There's not a lot of work, to be honest. On the parsing side the main
 difficulty is getting a type-safe delegate to read a message from the
 stream. The writing side is trivial. Have a look at the code:
 
 

Re: [Fwd: Re: Streaming]

2008-12-05 Thread Kenton Varda
It's quite easy to write a helper function that reads/writes delimited
messages (delimited by size or by end tag).
For example, here's one for writing a length-delimited message:

bool WriteMessage(const Message message, ZeroCopyOutputStream* output) {
  CodedOutputStream coded_out(output);
  return coded_out.WriteVarint32(message.ByteSize()) 
message.SerializeWithCachedSizes(coded_out);
}

and here's one for reading one message:

bool ReadMessage(ZeroCopyInputStream* input, Message* message) {
  CodedInputStream coded_in(input);
  uint32 size;
  if (!coded_in.ReadVarint32(size)) return false;
  CodedInputStream::Limit limit = coded_in.PushLimit(size);
  if (!message-ParseFromCodedStream(coded_in)) return false;
  if (!coded_in.ExpectAtEnd()) return false;
  coded_in.PopLimit(limit);
  return true;
}

(I haven't tested the above so it may contain minor errors.)

We could add these as methods of the Message class.  Note, though, that for
many applications, this kind of streaming is too simplistic.  For example,
the above will not allow you to efficiently seek to an arbitrary message in
the stream, since at the very least you have to read the sizes of all
messages before it to find it.  It's also not very robust in the face of
data corruption -- if any of the sizes are corrupted, the whole stream is
unreadable.  So, you may find you want to do something more complicated,
depending on your app.  But, anything more complicated is really beyond the
scope of the protocol buffer library.

On Fri, Dec 5, 2008 at 8:27 AM, Shane Green [EMAIL PROTECTED]wrote:


 Thanks very much Jon (see below).  You make good points and I like the
 approach the you describe.  I am still thinking, however, that there is
 power in the ability for message instances to write and parse themselves
 from a stream.

 A message instance could be passed a stream object which chains back to
 the network connection from which bytes are being received.  A stop flag
 based parsing mechanism could be passed this buffer object, and would
 handle reading the stream and initializing its properties, exiting when
 the serialization of that message instance stopped.  At this point, a
 new message instance could be created, and the process repeated.

 The type of message doing the parsing could vary from message to
 message, even with the serializations being sent and received back to
 back.  This mechanism would work regardless of field-types being
 streamed.  A message type consisting solely of varint fields, whose
 length is determined while reading the varint's value, would support
 streaming no differently than any other message type.

 The solution also seems to support every requirement supported by the
 original buffer type.  Messages serialized to a buffer, could just as
 easily be initialized from that buffer as they could from the string
 contained by the buffer.

 m1 = Message()
 buffer = Buffer()
 [...] (initialize instance vars)
 m1.SerializeToBuffer(buffer)

 m2 = Message()
 m2.ParseFromBuffer(buffer)

 Produces same result as:

 m2 = Message()
 bytes = m1.SerializeToString()
 m2.ParseFromString(bytes)

 The string-based parse would ignore the stop bit, parsing the entire
 string.  The buffer-based parsing would stop parsing when the stop bit,
 producing the same result.

 Handling of concatenated serializations is supported through repeated
 calls to parse from buffer:

 m1 = Message()
 [...] (initialize instance vars)
 m2 = Message()
 [...] (initialize instance vars)

 buffer = Buffer()
 m1.SerializeToBuffer(buffer)
 m2.SerializeToBuffer(buffer)

 m3 = Message()
 m3.ParseFromBuffer(buffer)
 m3.ParseFromBuffer(buffer)

 Would produce same result as:

 m3 = Message()
 m3.ParseFromString(m1.SerializeToString() + m2.SerializeToString())

 As long as an unused, and never to be used, field number is used to
 generate the stop bit's key, then I don't believe there are any
 incompatibilities between buffer-based message marshalling and the
 existing string-based code.  A very easy usage:

 # Sending side
 for message in messages:
  message.SerializeToBuffer(buffer)

 # Receiving side
 for msgtype in types:
  message = msgtype()
  message.ParseFromBuffer(buffer)

 Unless I've overlooked something, it seems like the stream based
 marshalling and unmarshalling is powerful, simple, and completely
 compatible with all existing code.  But there is a very real chance I've
 overlooked something...




 - Shane


  Forwarded Message 
  From: Jon Skeet [EMAIL PROTECTED]
  To: Shane Green [EMAIL PROTECTED]
  Subject: Re: Streaming
  Date: Fri, 5 Dec 2008 08:19:41 +
 
  2008/12/5 Shane Green [EMAIL PROTECTED]
  Thanks Jon.  Those are good points.  I rather liked the
  self-delimiting
  nature of fields, and thought this method would bring that
  feature up to
  the message level, without breaking any of the existing
  capabilities.
  So my goal was a message which could truly