On 1 Apr 2014, at 11:12, Lewis John Mcgibbney 
<[email protected]<mailto:[email protected]>> wrote:
Right now we maintain only the Writer's schema, which as I mentioned is 
appended within the generated Persistent Java bean. In my own experience (and 
as you've hinted at :) ) this had/has caused us problems in the past.
For example we added a new (pretty innocent) string Field 'batchId' to our 
WebPage Schema [0] over in Nutch meaning that new Records being written 
included it and older records already within the data set did not.
{"name": "batchId", "type": "string"}
This inevitably threw NPE when certain Tools attempted to access certain 
records which the batchId Field and value was absent.

I have seen several people get confused about this before -- you're not alone. 
I actually think the fact that you have two different schemas when reading is 
the thing that most confuses people who are new to Avro. It's so different from 
what most people are used to.

So taking a bit of advice from a well recognized voice in this area (uh hum ;))

Haha ;)

For those following along on the mailing list, Lewis quoted from my blog post: 
http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Fortunately in the above example this particular Schema has only changed once 
in some 2 or 3 years. However it HAS changed.

It's probably safe to assume that every schema will have to change sooner or 
later.

Looks like I am also taking a lesson from this thread and we have a bit more 
work to do on Gora to address the above points. This is of course unless I have 
missed something!

A proposal to create a registry of Avro schemas has been a long time coming 
(https://issues.apache.org/jira/browse/AVRO-1124). This would allow you to 
include a small version number or hash of the schema in each record, to 
indicate the writer schema that was used to encode it. That would be much lower 
overhead than including the entire schema with every record.

As Gora is itself a database access layer, you can probably store the schemas 
in the same database as the records. If you go ahead and implement this, it 
would be great if you could keep compatibility with the AVRO-1124 schema 
registry in mind.

If Gora can hide the writer/reader schema distinction from users, and just do 
the right thing with schema evolution, that would be awesome!

Martin

Reply via email to