Re: question about completely untagged data...

Bruce Mitchener Sun, 28 Nov 2010 20:44:38 -0800

To be clear, HAvroBase stores tuples of (schema ID, data) and then looks up
the schema from that ID.  It doesn't store each schema separately / entirely
alongside the corresponding data records / entries.


HAvroBase is really pretty nice and has backends for storing data into
things other than HBase...

 - Bruce

On Mon, Nov 29, 2010 at 11:09 AM, Philip Zeyliger <[email protected]>wrote:

> Hi David,
>
> Your assessment of Thrift and Avro being isomorphic is correct, and
> you've correctly identified the major philosophical difference.  (It's
> in fact a little bit deeper than you suggest: at read time, there are
> always two schemas available: the reader's schema and the original
> schema that the data was written with.)
>
> Where are you storing the Avro records?  Avro's binary format for
> records is unlikely to change: it's pretty stable and changing would
> be a big deal.  On the other hand, Avro already has multiple ways for
> passing schema information along.  Avro's RPC implementations do one
> thing.  Avro Data File store the schema in the header.  You could, in
> your system, always store (schema, data) tuples.  That's what Sam is
> doing in HAvroBase
> (
> http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/
> ).
>
> -- Philip
>
> On Sun, Nov 28, 2010 at 6:39 PM, David Jeske <[email protected]> wrote:
> > I have a storage project considering adding Thrift or Avro to for record
> > packing, and I have a couple questions.
> > Other than than type-id and field-ids, Avro and Thrift's designs seem
> > isomorphic. Is the binary format not including field-type-info something
> > that's set in stone, or something that's open for feedback?
> > I prefer the philosophy of Avro, namely to expect schemas to be
> available,
> > use those schemas for compatibility mapping, and to support dynamic
> schema
> > parsing in any supported language. In fact, being able to parse schemas
> > dynamically in any language is the real draw of Avro for me. (personally
> I'd
> > prefer if they were actually Avro IDL, instead of JSON, but I understand
> > that might complicate implementing client stubs).
> > However, the fact that data is not tagged with any type-information is
> > unacceptable dangerous for my application. There will be mechanisms for
> > mapping records to schemas, and schemas will be stored, but if a schema
> were
> > ever lost or corrupted, I can't afford for the data to turn into total
> junk.
> > Unless data is trivially small, encoding a field type wouldn't change the
> > size of the encoding much, but would provide some 'sanity checking' in
> > addition to be able to recover the raw data even if a schema was lost or
> the
> > schema ID for a piece of data was corrupted.
> > Since Avro is relatively new, I'm asking to find out if this is anathama
> to
> > the entire concept of Avro, or something something that was chosen, but
> > might be reconsidered eventually.
> > Going the thrift route for me will mean injecting a bit of the Avro
> > philosophy into Thrift, namely, adding a Thrift IDL parser to the
> language I
> > need, so I can save Thrift IDLs and then dynamically read them. However,
> > doing this as a one-off for my language different than having a supported
> > mechanism for all client languages -- like in Avro.
> >
> >
>

Re: question about completely untagged data...

Reply via email to