Re: [protobuf] New protobuf feature proposal: Generated classes for streaming / visitors

2011-02-08 Thread Kenton Varda
On Tue, Feb 8, 2011 at 11:23 AM, Evan Jones  wrote:

> Sorry, just an example of why you might want a different protocol. If I've
> streamed 10e9 messages to disk, I don't want this stream to break if there
> is some weird corruption in the middle, so I want some protocol that can
> "resume" from corruption.
>

Ah, yes.  This isn't an appropriate protocol for enormous files.  It's
more targeted at network protocols.

Although, you might be able to build a decent seekable file protocol on top
of it, by choosing a random string to use as a sync point, then writing that
string every now and then...

  message FileStream {
repeated string sync_point = 1;

repeated Foo foo = 2;
repeated Bar bar = 3;
...
  }

When writing, after every few messages, write a copy of sync_point.  Then,
you can seek to an arbitrary position in the file by looking for a nearby
copy of the sync point byte sequence, and starting to parse immediately
after that.  The sync point just needs to be an 128-bit (or so)
cryptographically random sequence, chosen differently for each file, so that
there's no chance that the bytes will appear in the file by accident.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] New protobuf feature proposal: Generated classes for streaming / visitors

2011-02-08 Thread Evan Jones

On Feb 8, 2011, at 13:34 , Kenton Varda wrote:
I handle user messages by passing them as "bytes", embedded in my  
own outer message.


This is what I do as well, as does protobuf-socket-rpc:

http://code.google.com/p/protobuf-socket-rpc/source/browse/trunk/proto/rpc.proto


I guess I was thinking that if you already have to do some sort of  
"lookup" of the message type that is stored in that byte blob, then  
maybe you don't need the streaming extension. For example, you could  
just build a library that produces a sequence of byte strings, which  
the "user" of the library can then parse appropriately.


I see how you are using it though: it is a friendly wrapper around  
this simple "sequence of byte strings" model, that automatically  
parses that byte string using the tag and "schema message." This might  
be useful for some people.


This is somewhat inefficient currently, as it will require an extra  
copy of all those bytes.  However, it seems likely that future  
improvements to protocol buffers will allow "bytes" fields to share  
memory with the original buffer, which will eliminate this concern.


Ah cool. I was considering changing my protocol to be two messages:  
the first one is the "descriptor" (eg. your CallRequest message), then  
the second would be the "body" of the request, which I would then  
parse based on the type passed in the CallRequest.



Note that I expect people will generally only "stream" their top- 
level message.  Although the proposal allows for streaming sub- 
messages as well, I expect that people will normally want to parse  
them into message objects which are handled whole.  So, you only  
have to manually implement the top-level stream, and then you can  
invoke some reflective algorithm from there.


Right, but my concern is that I might want to use this streaming API  
to write messages into files. In this case, I might have a file  
containing the FooStream and another file containing the BarStream.  
I'll have to implement both these ::Writer interfaces, or hack the  
code generator to generate it for me. Although now that I think about  
this, the implementation of these two APIs will be relatively trivial...



features like being able to detect broken streams and "resume" in  
the middle are useful.
I'm not sure how this relates.  This seems like it should be handled  
at a lower layer, like in the InputStream -- if the connection is  
lost, it can re-establish and resume, without the parser ever  
knowing what happened.


Sorry, just an example of why you might want a different protocol. If  
I've streamed 10e9 messages to disk, I don't want this stream to break  
if there is some weird corruption in the middle, so I want some  
protocol that can "resume" from corruption.


Evan

--
http://evanjones.ca/

--
You received this message because you are subscribed to the Google Groups "Protocol 
Buffers" group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] New protobuf feature proposal: Generated classes for streaming / visitors

2011-02-08 Thread Kenton Varda
On Tue, Feb 8, 2011 at 5:47 AM, Evan Jones  wrote:

> I read this proposal somewhat carefully, and thought about it for a couple
> days.


Thanks for the feedback!


> * It seems to me that this will solve the problem for people who know
> statically at compile time what types they need to handle from a stream, so
> they can define the "stream type" appropriately. Will users find themselves
> running into the case where they need to handle "generic" messages, and end
> up needing to "roll their own" stream support anyway?
>
> I ask this question because I built my own RPC system on top of protocol
> buffers, and in this domain it is useful to be able to pass "unknown"
> messages around, typically as unparsed byte strings. Hence, this streams
> proposal wouldn't be useful to me, so I'm just wondering: am I an anomaly
> here, or could it be that many applications will find themselves needing to
> handle "any" protocol buffer message in their streams?


In fact, a large part of my motivation for writing this was so that I can
use it in my own RPC implementation, Captain Proto.  Here's the Captain
Proto protocol, which already works in this "streaming" fashion:

http://code.google.com/p/capnproto/source/browse/proto/capnproto.proto

I handle user messages by passing them as "bytes", embedded in my own outer
message.  This is somewhat inefficient currently, as it will require an
extra copy of all those bytes.  However, it seems likely that future
improvements to protocol buffers will allow "bytes" fields to share memory
with the original buffer, which will eliminate this concern.


>  The Visitor class has two standard implementations:  "Writer" and
>> "Filler".  MyStream::Writer writes the visited fields to a
>> CodedOutputStream, using the same wire format as would be used to encode
>> MyStream as one big message.
>>
>
> Imagine I wanted a different protocol. Eg. I want something that checksums
> each message, or maybe compresses them, etc. Will I need to subclass
> MessageType::Visitor for each stream that I want to encode? Or will I need
> to change the code generator?


To do these things generically, we'd need to introduce some sort of
equivalent of Reflection for streams.  This certainly seems like it could be
a useful addition to the family, but I wanted to get the basic functionality
out there first and then see if this is needed.

Note that I expect people will generally only "stream" their top-level
message.  Although the proposal allows for streaming sub-messages as well, I
expect that people will normally want to parse them into message objects
which are handled whole.  So, you only have to manually implement the
top-level stream, and then you can invoke some reflective algorithm from
there.


> features like being able to detect broken streams and "resume" in the
> middle are useful.
>

I'm not sure how this relates.  This seems like it should be handled at a
lower layer, like in the InputStream -- if the connection is lost, it can
re-establish and resume, without the parser ever knowing what happened.

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] New protobuf feature proposal: Generated classes for streaming / visitors

2011-02-08 Thread Evan Jones
I read this proposal somewhat carefully, and thought about it for a  
couple days. I think something like this might solve the problem that  
many people have with streams of messages. However, I was wondering a  
couple things about the design:



* It seems to me that this will solve the problem for people who know  
statically at compile time what types they need to handle from a  
stream, so they can define the "stream type" appropriately. Will users  
find themselves running into the case where they need to handle  
"generic" messages, and end up needing to "roll their own" stream  
support anyway?


I ask this question because I built my own RPC system on top of  
protocol buffers, and in this domain it is useful to be able to pass  
"unknown" messages around, typically as unparsed byte strings. Hence,  
this streams proposal wouldn't be useful to me, so I'm just wondering:  
am I an anomaly here, or could it be that many applications will find  
themselves needing to handle "any" protocol buffer message in their  
streams?



The Visitor class has two standard implementations:  "Writer" and  
"Filler".  MyStream::Writer writes the visited fields to a  
CodedOutputStream, using the same wire format as would be used to  
encode MyStream as one big message.


Imagine I wanted a different protocol. Eg. I want something that  
checksums each message, or maybe compresses them, etc. Will I need to  
subclass MessageType::Visitor for each stream that I want to encode?  
Or will I need to change the code generator? Maybe this is an unusual  
enough need that the design doesn't need to be flexible enough to  
handle this, but it is worth thinking about a little, since features  
like being able to detect broken streams and "resume" in the middle  
are useful.


Thanks!

Evan

--
http://evanjones.ca/

--
You received this message because you are subscribed to the Google Groups "Protocol 
Buffers" group.
To post to this group, send email to protobuf@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] New protobuf feature proposal: Generated classes for streaming / visitors

2011-02-01 Thread Kenton Varda
Hello open source protobuf users,

*Background*

Probably the biggest deficiency in the open source protocol buffers
libraries today is a lack of built-in support for handling streams of
messages.  True, it's not too hard for users to support it manually, by
prefixing each message with its size as described here:

  http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming

However, this is awkward, and typically requires users to reach into the
low-level CodedInputStream/CodedOutputStream classes and do a lot of work
manually.  Furthermore, many users want to handle streams
of heterogeneous message types.  We tell them to wrap their messages in an
outer type using the "union" pattern:

  http://code.google.com/apis/protocolbuffers/docs/techniques.html#union

But this is kind of ugly and has unnecessary overhead.

These problems never really came up in our internal usage, because inside
Google we have an RPC system and other utility code which builds on top of
protocol buffers and provides appropriate abstraction. While we'd like to
open source this code, a lot of it is large, somewhat messy, and highly
interdependent with unrelated parts of our environment, and no one has had
the time to rewrite it all cleanly (as we did with protocol buffers itself).

*Proposed solution:  Generated Visitors*

I've been wanting to fix this for some time now, but didn't really have a
good idea how.  CodedInputStream is annoyingly low-level, but I couldn't
think of much better an interface for reading a stream of messages off the
wire.

A couple weeks ago, though, I realized that I had been failing to consider
how new kinds of code generation could help this problem.  I was trying to
think of solutions that would go into the protobuf base library, not
solutions that were generated by the protocol compiler.

So then it became pretty clear:  A protobuf message definition can also be
interpreted as a definition for a streaming protocol.  Each field in the
message is a kind of item in the stream.

  // A stream of Foo and Bar messages, and also strings.
  message MyStream {
option generate_visitors = true;  // enables generation of streaming
classes
repeated Foo foo = 1;
repeated Bar bar = 2;
repeated string baz = 3;
  }

All we need to do is generate code appropriate for treating MyStream as a
stream, rather than one big message.

My approach is to generate two interfaces, each with two provided
implementations.  The interfaces are "Visitor" and "Guide".
 MyStream::Visitor looks like this:

  class MyStream::Visitor {
   public:
virtual ~Visitor();

virtual void VisitFoo(const Foo& foo);
virtual void VisitBar(const Bar& bar);
virtual void VisitBaz(const std::string& baz);
  };

The Visitor class has two standard implementations:  "Writer" and "Filler".
 MyStream::Writer writes the visited fields to a CodedOutputStream, using
the same wire format as would be used to encode MyStream as one big message.
 MyStream::Filler fills in a MyStream message object with the visited
values.

Meanwhile, Guides are objects that drive Visitors.

  class MyStream::Guide {
   public:
virtual ~Guide();

// Call the methods of the visitor on the Guide's data.
virtual void Accept(MyStream::Visitor* visitor) = 0;

// Just fill in a message object directly rather than use a visitor.
virtual void Fill(MyStream* message) = 0;
  };

The two standard implementations of Guide are "Reader" and "Walker".
 MyStream::Reader reads items from a CodedInputStream and passes them to the
visitor.  MyStream::Walker walks over a MyStream message object and passes
all the fields to the visitor.

To handle a stream of messages, simply attach a Reader to your own Visitor
implementation.  Your visitor's methods will then be called as each item is
parsed, kind of like "SAX" XML parsing, but type-safe.

*Nonblocking I/O*

The "Reader" type declared above is based on blocking I/O, but many users
would prefer a non-blocking approach.  I'm less sure how to handle this, but
my thought was that we could provide a utility class like:

  class NonblockingHelper {
   public:
template 
NonblockingHelper(typename MessageType::Visitor* visitor);

// Push data into the buffer.  If the data completes any fields,
// they will be passed to the underlying visitor.  Any left-over data
// is remembered for the next call.
void PushData(void* data, int size);
  };

With this, you can use whatever non-blocking I/O mechanism you want, and
just have to push the data into the NonblockingHelper, which will take care
of calling the Visitor as necessary.

*C++ implementation*

I've written up a patch implementing this for C++ (not yet including the
nonblocking part):

  http://codereview.appspot.com/4077052

*Feedback*

What do you think?

I know I'm excited to use this in some of my own side projects (which is why
I spent my weekend working on it), but before adding this to the official
implementation we sh