Schema evolution in Avro requires access to both the schema used when writing the data and the desired Schema for reading the data.
Normally, Avro data is stored in some container format (i.e. the one in the spec[1]) and the parsing library takes care of pulling the schema used when writing out of said container. If you are using Avro data in some other location, you must have the writer schema as well. One common use case is a shared messaging system focused on small messages (but that doesn't use Avro RPC). In such cases, Doug Cutting has some guidance he's previously given (quoted with permission, albeit very late): > A best practice for things like this is to prefix each Avro record > with a (small) numeric schema ID. This is used as the key for a > shared database of schemas. The schema corresponding to a key never > changes, so the database can be cached heavily. It never gets very > big either. It could be as simple as a .java file, with the > constraint that you'd need to upgrade things downstream before > upstream, or as complicated as an enterprise-wide REST schema service > (AVRO-1124). A variation is to use schema fingerprints as keys. > > Potentially relevant stuff: > > https://issues.apache.org/jira/browse/AVRO-1124 > http://avro.apache.org/docs/current/spec.html#Schema+Fingerprints If you take the integer schema ID approach, you can use Avro's built in utilities for zig-zap encoding, which will ensure that most of the time your identifier only takes a small amount of space. [1]: http://avro.apache.org/docs/current/spec.html#Object+Container+Files On Tue, Feb 3, 2015 at 5:57 AM, Burak Emre <[email protected]> wrote: > I added a field with a default value to an Avro schema which is previously > used for writing data. Is it possible to read the previous data using *only > new schema* which has that new field at the end? > > I tried this scenario but unfortunately it throws EOFException while > reading third field. Even though it has a default value and the previous > fields is read successfully, I'm not able to de-serialize the record back > without providing the writer schema I used previously. > > Schema schema = Schema.createRecord("test", null, "avro.test", false); > schema.setFields(Lists.newArrayList( > new Field("project", Schema.create(Type.STRING), null, null), > new Field("city", > Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), > Schema.create(Type.STRING))), null, NullNode.getInstance()))); > GenericData.Record record = new GenericRecordBuilder(schema) > .set("project", "ff").build(); > GenericDatumWriter w = new GenericDatumWriter(schema);ByteArrayOutputStream > outputStream = new ByteArrayOutputStream();BinaryEncoder encoder = > EncoderFactory.get().binaryEncoder(outputStream, null); > > w.write(record, encoder); > encoder.flush(); > > schema = Schema.createRecord("test", null, "avro.test", false); > schema.setFields(Lists.newArrayList( > new Field("project", Schema.create(Type.STRING), null, null), > new Field("city", > Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), > Schema.create(Type.STRING))), null, NullNode.getInstance()), > new Field("newField", > Schema.createUnion(Lists.newArrayList(Schema.create(Type.NULL), > Schema.create(Type.STRING))), null, NullNode.getInstance()))); > DatumReader<GenericRecord> reader = new GenericDatumReader<>(schema);Decoder > decoder = DecoderFactory.get().binaryDecoder(outputStream.toByteArray(), > null);GenericRecord result = reader.read(null, decoder); > > > -- Sean
