On Sun, Nov 28, 2010 at 8:09 PM, Philip Zeyliger <[email protected]>wrote:
> Where are you storing the Avro records? This is part of a database/storage project. To avoid the overhead of a schema-per record, I can store a schema-ID per record, and then have a directory of schemas in the system. However, if somehow the place that the schema is stored gets botched (id gets corrupted, schema file gets corrupted or lost, etc), the records would become completely unintelligible. That sounds like a scarry prospect. > You could, in your system, always store (schema, data) tuples. That's what > Sam is > doing in HAvroBase > ( > http://www.javarants.com/2010/06/30/havrobase-a-searchable-evolvable-entity-store-on-top-of-hbase-and-solr/ > ). > This sounds fine for something where records are documents and record sizes are quite big. However, in my application there are going to be too many records proportional to (or smaller than) the schema-size for this to be practical. Without compression this would double or triple the data-size. With compression it's a bunch of unnecessary extra work decoding and rencoding schemas. I suppose I could use Avro's API to dump a "dense binary type-only schema" that wouldn't have the names of types, only a packed format of the types themselves. This would essentially be the same as Thrift, except that the types would be packed at the beginning of the record (or end) instead of interspersed with the records. In the common case Avro would be handed the "real" schema any (old and new), so it wouldn't even be looking at this. It would just be in there for safety sake in case we needed to do some disaster recovery.
