As an alternative to the standard AvroInput/OutputFormat, I've been playing around with how to support alternate Avro file types like Trevni[1], which give benefits when we want to only retrieve a subset of the Avro object.
Picking one of the implementations (AvroTrevniKeyInputFormat/AvroTrevniKeyOutputFormat)[2], I implemented the various Source/Target/SourceTarget implementations. When I started trying to test it out (to see if I did any of it right), I hit the issue that the AvroKeyConverter only produces AvroWrapper objects and the output format requires AvroKey. So I get ClassCastExceptions CrunchOutputs.write(...) method. Caused by: java.lang.ClassCastException: org.apache.avro.mapred.AvroWrapper cannot be cast to org.apache.avro.mapred.AvroKey at org.apache.trevni.avro.mapreduce.AvroTrevniKeyRecordWriter.write(AvroTrevniKeyRecordWriter.java:34) at org.apache.crunch.io.CrunchOutputs.write(CrunchOutputs.java:129) I was hoping that the target would be able to take any PCollection<? extends AvroType> but it looks like I'd need to implement my own PType and force consumers to use that just to change the converter to produce AvroKey instead. Is implementing a custom PType the only way to inject an alternate converter? That seems like a high cost on the implementation side and forcing a restriction onto others in the pipeline who are generally happy with the standard AvroType and shouldn't be burdened with how the data might be stored later on in the processing. Thoughts? [1] - http://avro.apache.org/docs/current/trevni/spec.html [2] - http://avro.apache.org/docs/current/api/java/org/apache/trevni/avro/mapreduce/AvroTrevniKeyOutputFormat.html
