question about completely untagged data...

David Jeske Sun, 28 Nov 2010 18:40:05 -0800

I have a storage project considering adding Thrift or Avro to for record
packing, and I have a couple questions.


Other than than type-id and field-ids, Avro and Thrift's designs seem
isomorphic. *Is the binary format not including field-type-info something
that's set in stone, or something that's open for feedback? *

I prefer the philosophy of Avro, namely to expect schemas to be available,
use those schemas for compatibility mapping, and to support dynamic schema
parsing in any supported language. In fact, being able to parse schemas
dynamically in any language is the real draw of Avro for me. (personally I'd
prefer if they were actually Avro IDL, instead of JSON, but I understand
that might complicate implementing client stubs).

However, the fact that data is not tagged with any type-information is
unacceptable dangerous for my application. There will be mechanisms for
mapping records to schemas, and schemas will be stored, but if a schema were
ever lost or corrupted, I can't afford for the data to turn into total junk.
Unless data is trivially small, encoding a field type wouldn't change the
size of the encoding much, but would provide some 'sanity checking' in
addition to be able to recover the raw data even if a schema was lost or the
schema ID for a piece of data was corrupted.

Since Avro is relatively new, I'm asking to find out if this is anathama to
the entire concept of Avro, or something something that was chosen, but
might be reconsidered eventually.

Going the thrift route for me will mean injecting a bit of the Avro
philosophy into Thrift, namely, adding a Thrift IDL parser to the language I
need, so I can save Thrift IDLs and then dynamically read them. However,
doing this as a one-off for my language different than having a supported
mechanism for all client languages -- like in Avro.

question about completely untagged data...

Reply via email to