On 11/29/2010 11:04 AM, David Jeske wrote:
I don't follow how this would be possible with Avro. With no type
information, how would you tell the difference between an array of ints,
a bunch of enums, a binary chunk of data, or even just a string? Thrift
and Protobufs have the types so understanding the structure would be
trivial, it's only the meaning that would need to be re-derived.

Protobuf binary only has sizes, not types. Thrift's efficient encodings probably also just have sizes.

If you have a file with 1M records of the same structure, it's usually not hard to find patterns. Strings stand out and punctuate things. Byte arrays are also often easy to identify, especially since they're length prefixed. Fields that often have the same value (i.e., zero or one) also help punctuate. However a record that contains only four random single-point floating-point values versus an Avro "fixed" containing 16 random bytes could be hard to distinguish. In my experience, structures like these are less common. In this case, protobuf and thrift would let you know that one had four four-byte values and the other one 16-byte value, which would be helpful but not definitive. Also, if you have a table, you often have some idea what it contains.

Doug

Reply via email to