we are on a fairly old avro (1.5.4) so not sure my observations apply to newer versions. i noticed that when i read from avro files in hadoop it does not expect the reader's schema (fully qualified) name to be equal to the writer's schema (fully qualified) name. this allows me to read from files without knowing what name the schema had when it was written. according to doug cutting this is a bug and the read should not succeed if the reader's and writer's schema do not have the same name. also when the schema names are not the same then field aliases do not work.
ok with that out of the way this is my situation: we create lots of avro files that we add to large partitioned tables (a structure with subdirs on hdfs). the people that write the files understand the importance of canonical columns names (field names), but not everyone gets the idea of schema names, so generally i have avro files with name different (writer's) schema names in there. i do not expect i can correct this. also it is not unusual to run a hadoop map-red job reading from many different data sources at once, using avro's fantastic projection ability to extract just a few columns. however in that case again the (writer's) schema names are not expected to be the same across avro files i am reading from. so today all of this works, meaning i can run map-reduce jobs across all these files with difference/inconsistent schema names, but only thanks to a bug, which makes me nervous one day it will not work. also field aliases do not work, which is a real limitation. so i am trying to see if i can come up with a better solution. of course i could go find out every times what all the schema names are in the avro files, and add all aliases to my reader's schema. but that is real pain, in particular since the set is not constant. i guess i could automate this by scanning all the avro files first and extracting their schemas. however it sounds very inelegant. so i rather not do that. so i have 2 questions: 1) can i reasonably assume that processing in hadoop will continue to work even if the reader and writer's schema names are not the same (so rely on this bug)? the fact that field aliases do not work in this case is too bad but at least i got something working... 2) is there a better solution? for example, something like where i could say in my reader's schema that the schema has an alias of * (wildcard) so that i can read from all these files with different (writer's) schema names and it works without relying on a bug, and on top of that field aliases will also work? that would be fantastic...
