Re: [protobuf] New protobuf feature proposal: Generated classes for streaming / visitors
On Tue, Feb 8, 2011 at 11:23 AM, Evan Jones wrote: > Sorry, just an example of why you might want a different protocol. If I've > streamed 10e9 messages to disk, I don't want this stream to break if there > is some weird corruption in the middle, so I want some protocol that can > "resume" from corruption. > Ah, yes. This isn't an appropriate protocol for enormous files. It's more targeted at network protocols. Although, you might be able to build a decent seekable file protocol on top of it, by choosing a random string to use as a sync point, then writing that string every now and then... message FileStream { repeated string sync_point = 1; repeated Foo foo = 2; repeated Bar bar = 3; ... } When writing, after every few messages, write a copy of sync_point. Then, you can seek to an arbitrary position in the file by looking for a nearby copy of the sync point byte sequence, and starting to parse immediately after that. The sync point just needs to be an 128-bit (or so) cryptographically random sequence, chosen differently for each file, so that there's no chance that the bytes will appear in the file by accident. -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] New protobuf feature proposal: Generated classes for streaming / visitors
On Feb 8, 2011, at 13:34 , Kenton Varda wrote: I handle user messages by passing them as "bytes", embedded in my own outer message. This is what I do as well, as does protobuf-socket-rpc: http://code.google.com/p/protobuf-socket-rpc/source/browse/trunk/proto/rpc.proto I guess I was thinking that if you already have to do some sort of "lookup" of the message type that is stored in that byte blob, then maybe you don't need the streaming extension. For example, you could just build a library that produces a sequence of byte strings, which the "user" of the library can then parse appropriately. I see how you are using it though: it is a friendly wrapper around this simple "sequence of byte strings" model, that automatically parses that byte string using the tag and "schema message." This might be useful for some people. This is somewhat inefficient currently, as it will require an extra copy of all those bytes. However, it seems likely that future improvements to protocol buffers will allow "bytes" fields to share memory with the original buffer, which will eliminate this concern. Ah cool. I was considering changing my protocol to be two messages: the first one is the "descriptor" (eg. your CallRequest message), then the second would be the "body" of the request, which I would then parse based on the type passed in the CallRequest. Note that I expect people will generally only "stream" their top- level message. Although the proposal allows for streaming sub- messages as well, I expect that people will normally want to parse them into message objects which are handled whole. So, you only have to manually implement the top-level stream, and then you can invoke some reflective algorithm from there. Right, but my concern is that I might want to use this streaming API to write messages into files. In this case, I might have a file containing the FooStream and another file containing the BarStream. I'll have to implement both these ::Writer interfaces, or hack the code generator to generate it for me. Although now that I think about this, the implementation of these two APIs will be relatively trivial... features like being able to detect broken streams and "resume" in the middle are useful. I'm not sure how this relates. This seems like it should be handled at a lower layer, like in the InputStream -- if the connection is lost, it can re-establish and resume, without the parser ever knowing what happened. Sorry, just an example of why you might want a different protocol. If I've streamed 10e9 messages to disk, I don't want this stream to break if there is some weird corruption in the middle, so I want some protocol that can "resume" from corruption. Evan -- http://evanjones.ca/ -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] New protobuf feature proposal: Generated classes for streaming / visitors
On Tue, Feb 8, 2011 at 5:47 AM, Evan Jones wrote: > I read this proposal somewhat carefully, and thought about it for a couple > days. Thanks for the feedback! > * It seems to me that this will solve the problem for people who know > statically at compile time what types they need to handle from a stream, so > they can define the "stream type" appropriately. Will users find themselves > running into the case where they need to handle "generic" messages, and end > up needing to "roll their own" stream support anyway? > > I ask this question because I built my own RPC system on top of protocol > buffers, and in this domain it is useful to be able to pass "unknown" > messages around, typically as unparsed byte strings. Hence, this streams > proposal wouldn't be useful to me, so I'm just wondering: am I an anomaly > here, or could it be that many applications will find themselves needing to > handle "any" protocol buffer message in their streams? In fact, a large part of my motivation for writing this was so that I can use it in my own RPC implementation, Captain Proto. Here's the Captain Proto protocol, which already works in this "streaming" fashion: http://code.google.com/p/capnproto/source/browse/proto/capnproto.proto I handle user messages by passing them as "bytes", embedded in my own outer message. This is somewhat inefficient currently, as it will require an extra copy of all those bytes. However, it seems likely that future improvements to protocol buffers will allow "bytes" fields to share memory with the original buffer, which will eliminate this concern. > The Visitor class has two standard implementations: "Writer" and >> "Filler". MyStream::Writer writes the visited fields to a >> CodedOutputStream, using the same wire format as would be used to encode >> MyStream as one big message. >> > > Imagine I wanted a different protocol. Eg. I want something that checksums > each message, or maybe compresses them, etc. Will I need to subclass > MessageType::Visitor for each stream that I want to encode? Or will I need > to change the code generator? To do these things generically, we'd need to introduce some sort of equivalent of Reflection for streams. This certainly seems like it could be a useful addition to the family, but I wanted to get the basic functionality out there first and then see if this is needed. Note that I expect people will generally only "stream" their top-level message. Although the proposal allows for streaming sub-messages as well, I expect that people will normally want to parse them into message objects which are handled whole. So, you only have to manually implement the top-level stream, and then you can invoke some reflective algorithm from there. > features like being able to detect broken streams and "resume" in the > middle are useful. > I'm not sure how this relates. This seems like it should be handled at a lower layer, like in the InputStream -- if the connection is lost, it can re-establish and resume, without the parser ever knowing what happened. -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
Re: [protobuf] New protobuf feature proposal: Generated classes for streaming / visitors
I read this proposal somewhat carefully, and thought about it for a couple days. I think something like this might solve the problem that many people have with streams of messages. However, I was wondering a couple things about the design: * It seems to me that this will solve the problem for people who know statically at compile time what types they need to handle from a stream, so they can define the "stream type" appropriately. Will users find themselves running into the case where they need to handle "generic" messages, and end up needing to "roll their own" stream support anyway? I ask this question because I built my own RPC system on top of protocol buffers, and in this domain it is useful to be able to pass "unknown" messages around, typically as unparsed byte strings. Hence, this streams proposal wouldn't be useful to me, so I'm just wondering: am I an anomaly here, or could it be that many applications will find themselves needing to handle "any" protocol buffer message in their streams? The Visitor class has two standard implementations: "Writer" and "Filler". MyStream::Writer writes the visited fields to a CodedOutputStream, using the same wire format as would be used to encode MyStream as one big message. Imagine I wanted a different protocol. Eg. I want something that checksums each message, or maybe compresses them, etc. Will I need to subclass MessageType::Visitor for each stream that I want to encode? Or will I need to change the code generator? Maybe this is an unusual enough need that the design doesn't need to be flexible enough to handle this, but it is worth thinking about a little, since features like being able to detect broken streams and "resume" in the middle are useful. Thanks! Evan -- http://evanjones.ca/ -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To post to this group, send email to protobuf@googlegroups.com. To unsubscribe from this group, send email to protobuf+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.
[protobuf] New protobuf feature proposal: Generated classes for streaming / visitors
Hello open source protobuf users, *Background* Probably the biggest deficiency in the open source protocol buffers libraries today is a lack of built-in support for handling streams of messages. True, it's not too hard for users to support it manually, by prefixing each message with its size as described here: http://code.google.com/apis/protocolbuffers/docs/techniques.html#streaming However, this is awkward, and typically requires users to reach into the low-level CodedInputStream/CodedOutputStream classes and do a lot of work manually. Furthermore, many users want to handle streams of heterogeneous message types. We tell them to wrap their messages in an outer type using the "union" pattern: http://code.google.com/apis/protocolbuffers/docs/techniques.html#union But this is kind of ugly and has unnecessary overhead. These problems never really came up in our internal usage, because inside Google we have an RPC system and other utility code which builds on top of protocol buffers and provides appropriate abstraction. While we'd like to open source this code, a lot of it is large, somewhat messy, and highly interdependent with unrelated parts of our environment, and no one has had the time to rewrite it all cleanly (as we did with protocol buffers itself). *Proposed solution: Generated Visitors* I've been wanting to fix this for some time now, but didn't really have a good idea how. CodedInputStream is annoyingly low-level, but I couldn't think of much better an interface for reading a stream of messages off the wire. A couple weeks ago, though, I realized that I had been failing to consider how new kinds of code generation could help this problem. I was trying to think of solutions that would go into the protobuf base library, not solutions that were generated by the protocol compiler. So then it became pretty clear: A protobuf message definition can also be interpreted as a definition for a streaming protocol. Each field in the message is a kind of item in the stream. // A stream of Foo and Bar messages, and also strings. message MyStream { option generate_visitors = true; // enables generation of streaming classes repeated Foo foo = 1; repeated Bar bar = 2; repeated string baz = 3; } All we need to do is generate code appropriate for treating MyStream as a stream, rather than one big message. My approach is to generate two interfaces, each with two provided implementations. The interfaces are "Visitor" and "Guide". MyStream::Visitor looks like this: class MyStream::Visitor { public: virtual ~Visitor(); virtual void VisitFoo(const Foo& foo); virtual void VisitBar(const Bar& bar); virtual void VisitBaz(const std::string& baz); }; The Visitor class has two standard implementations: "Writer" and "Filler". MyStream::Writer writes the visited fields to a CodedOutputStream, using the same wire format as would be used to encode MyStream as one big message. MyStream::Filler fills in a MyStream message object with the visited values. Meanwhile, Guides are objects that drive Visitors. class MyStream::Guide { public: virtual ~Guide(); // Call the methods of the visitor on the Guide's data. virtual void Accept(MyStream::Visitor* visitor) = 0; // Just fill in a message object directly rather than use a visitor. virtual void Fill(MyStream* message) = 0; }; The two standard implementations of Guide are "Reader" and "Walker". MyStream::Reader reads items from a CodedInputStream and passes them to the visitor. MyStream::Walker walks over a MyStream message object and passes all the fields to the visitor. To handle a stream of messages, simply attach a Reader to your own Visitor implementation. Your visitor's methods will then be called as each item is parsed, kind of like "SAX" XML parsing, but type-safe. *Nonblocking I/O* The "Reader" type declared above is based on blocking I/O, but many users would prefer a non-blocking approach. I'm less sure how to handle this, but my thought was that we could provide a utility class like: class NonblockingHelper { public: template NonblockingHelper(typename MessageType::Visitor* visitor); // Push data into the buffer. If the data completes any fields, // they will be passed to the underlying visitor. Any left-over data // is remembered for the next call. void PushData(void* data, int size); }; With this, you can use whatever non-blocking I/O mechanism you want, and just have to push the data into the NonblockingHelper, which will take care of calling the Visitor as necessary. *C++ implementation* I've written up a patch implementing this for C++ (not yet including the nonblocking part): http://codereview.appspot.com/4077052 *Feedback* What do you think? I know I'm excited to use this in some of my own side projects (which is why I spent my weekend working on it), but before adding this to the official implementation we sh