Re: [protobuf] Arbitrary corruption of repeated fields

2010-01-27 Thread Kenton Varda
On Wed, Jan 27, 2010 at 9:05 PM, Michael Poole  wrote:

> If you serialize the elements inside the Bag to the disk individually,
> you could prefix them with a synchronizing marker and length.  A marker
> would typically be a fixed-length pattern that is unlikely to appear in
> legitimate data -- starting with a zero byte is a good way given
> Protocol Buffers data, it should contain some other (ideally uncommon)
> bytes for robustness.
>

I'd add that the marker should also contain some sort of checksum, e.g. CRC.
 Otherwise, you might not detect corruption when it happens.  It's very easy
for a corrupt message to still appear to parse correctly.  In an environment
where corruption is a concern, you definitely want to verify all data to
make sure you don't accidentally start using garbage!

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Arbitrary corruption of repeated fields

2010-01-27 Thread Michael Poole
Stefan writes:

> What could I do reduce the risk of losing the entire list due to
> arbitrary corruption? What if corruption only occurs at the end of the
> file, would it be simpler to recover all the elements up to the
> corruption point?

If you serialize the elements inside the Bag to the disk individually,
you could prefix them with a synchronizing marker and length.  A marker
would typically be a fixed-length pattern that is unlikely to appear in
legitimate data -- starting with a zero byte is a good way given
Protocol Buffers data, it should contain some other (ideally uncommon)
bytes for robustness.

By reading the marker, length, message, and checking the next marker,
your program can be reasonably sure that the detected message boundaries
are correct.  Recovery then becomes a matter of looking for the next
synchronizing marker, and checking it the same way.

There is obviously a tradeoff between how much data you can lose with a
corrupted message and the per-message overhead.  If you were using the
particular example in your email, you might serialize a Bag that
contains several Items rather than serializing each Item individually.

Michael Poole

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Arbitrary corruption of repeated fields

2010-01-27 Thread Kenton Varda
Sorry, it is a non-goal of protocol buffers to provide message integrity --
this is left to a higher layer.  One byte of corruption in a protocol
message can very easily make it impossible to parse the remainder of the
message, or even make the rest of the message appear as parseable garbage.
 Therefore, trying to design code which can work around corruption in the
message is fraught with peril, and no one has tried.

If you need to be able to recover from corruption without discarding the
whole file, the way to do it is by designing your file format to contain
multiple protocol buffers framed in some way that allows you to continue
reading the others if one is corrupted.  This isn't something protocol
buffers can provide, but it would make sense for someone to write a library
on top of protobufs that provides it.

On Wed, Jan 27, 2010 at 8:40 PM, Stefan  wrote:

> Hello everybody,
>
> I have a small dilemma with regards to protocol buffers. I read the
> documentation but I still do not see a clear answer (I only use the
> Java version of protocol buffers). I hope I am not missing something
> really obvious here ...
>
> I have the following setup:
>
> message Item {
>optional string name = 1;
>optional string description = 2;
> }
>
> message Bag{
>repeated Item item= 1;
> }
>
> In the code, a Bag (with a significantly big number of items) gets
> serialized to a file. Now, lets suppose the file gets corrupted in the
> middle (arbitrary point). From my experiments, the entire content
> would be lost because Bag cannot be deserialized anymore. To construct
> the Bag, I use the parseFrom() method and I get exceptions. I do not
> see anything in the documentation that would suggest mergeFrom() would
> have a different result either.
>
> I do not expect to be able to recover any individual corrupted items
> but it would be nice to be able to recover the rest of the list.
>
> What could I do reduce the risk of losing the entire list due to
> arbitrary corruption? What if corruption only occurs at the end of the
> file, would it be simpler to recover all the elements up to the
> corruption point?
>
> Thanks for your help!
>
> --
> You received this message because you are subscribed to the Google Groups
> "Protocol Buffers" group.
> To post to this group, send email to proto...@googlegroups.com.
> To unsubscribe from this group, send email to
> protobuf+unsubscr...@googlegroups.com
> .
> For more options, visit this group at
> http://groups.google.com/group/protobuf?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Protocol Buffers" group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.