To be clear, HAvroBase stores tuples of (schema ID, data) and then looks up the schema from that ID. It doesn't store each schema separately / entirely alongside the corresponding data records / entries.
HAvroBase is really pretty nice and has backends for storing data into things other than HBase... - Bruce On Mon, Nov 29, 2010 at 11:09 AM, Philip Zeyliger <[email protected]>wrote: > Hi David, > > Your assessment of Thrift and Avro being isomorphic is correct, and > you've correctly identified the major philosophical difference. (It's > in fact a little bit deeper than you suggest: at read time, there are > always two schemas available: the reader's schema and the original > schema that the data was written with.) > > Where are you storing the Avro records? Avro's binary format for > records is unlikely to change: it's pretty stable and changing would > be a big deal. On the other hand, Avro already has multiple ways for > passing schema information along. Avro's RPC implementations do one > thing. Avro Data File store the schema in the header. You could, in > your system, always store (schema, data) tuples. That's what Sam is > doing in HAvroBase > ( > http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/ > ). > > -- Philip > > On Sun, Nov 28, 2010 at 6:39 PM, David Jeske <[email protected]> wrote: > > I have a storage project considering adding Thrift or Avro to for record > > packing, and I have a couple questions. > > Other than than type-id and field-ids, Avro and Thrift's designs seem > > isomorphic. Is the binary format not including field-type-info something > > that's set in stone, or something that's open for feedback? > > I prefer the philosophy of Avro, namely to expect schemas to be > available, > > use those schemas for compatibility mapping, and to support dynamic > schema > > parsing in any supported language. In fact, being able to parse schemas > > dynamically in any language is the real draw of Avro for me. (personally > I'd > > prefer if they were actually Avro IDL, instead of JSON, but I understand > > that might complicate implementing client stubs). > > However, the fact that data is not tagged with any type-information is > > unacceptable dangerous for my application. There will be mechanisms for > > mapping records to schemas, and schemas will be stored, but if a schema > were > > ever lost or corrupted, I can't afford for the data to turn into total > junk. > > Unless data is trivially small, encoding a field type wouldn't change the > > size of the encoding much, but would provide some 'sanity checking' in > > addition to be able to recover the raw data even if a schema was lost or > the > > schema ID for a piece of data was corrupted. > > Since Avro is relatively new, I'm asking to find out if this is anathama > to > > the entire concept of Avro, or something something that was chosen, but > > might be reconsidered eventually. > > Going the thrift route for me will mean injecting a bit of the Avro > > philosophy into Thrift, namely, adding a Thrift IDL parser to the > language I > > need, so I can save Thrift IDLs and then dynamically read them. However, > > doing this as a one-off for my language different than having a supported > > mechanism for all client languages -- like in Avro. > > > > >
