Re: Schema not getting saved along with Data

Martin Kleppmann Wed, 02 Apr 2014 14:02:56 -0700

On 1 Apr 2014, at 11:12, Lewis John Mcgibbney 
<[email protected]<mailto:[email protected]>> wrote:
Right now we maintain only the Writer's schema, which as I mentioned is 
appended within the generated Persistent Java bean. In my own experience (and 
as you've hinted at :) ) this had/has caused us problems in the past.
For example we added a new (pretty innocent) string Field 'batchId' to our 
WebPage Schema [0] over in Nutch meaning that new Records being written 
included it and older records already within the data set did not.
{"name": "batchId", "type": "string"}
This inevitably threw NPE when certain Tools attempted to access certain 
records which the batchId Field and value was absent.

I have seen several people get confused about this before -- you're not alone.
I actually think the fact that you have two different schemas when reading is
the thing that most confuses people who are new to Avro. It's so different from
what most people are used to.

So taking a bit of advice from a well recognized voice in this area (uh hum ;))

Haha ;)

For those following along on the mailing list, Lewis quoted from my blog post:
http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Fortunately in the above example this particular Schema has only changed once
in some 2 or 3 years. However it HAS changed.

It's probably safe to assume that every schema will have to change sooner or
later.

Looks like I am also taking a lesson from this thread and we have a bit more
work to do on Gora to address the above points. This is of course unless I have
missed something!

A proposal to create a registry of Avro schemas has been a long time coming
(https://issues.apache.org/jira/browse/AVRO-1124). This would allow you to
include a small version number or hash of the schema in each record, to
indicate the writer schema that was used to encode it. That would be much lower
overhead than including the entire schema with every record.

As Gora is itself a database access layer, you can probably store the schemas
in the same database as the records. If you go ahead and implement this, it
would be great if you could keep compatibility with the AVRO-1124 schema
registry in mind.

If Gora can hide the writer/reader schema distinction from users, and just do
the right thing with schema evolution, that would be awesome!

Martin

Re: Schema not getting saved along with Data

Reply via email to