Are the multiple schemas a series of schema evolutions? That is, is there an obvious 'reader' schema, or are they disjoint? If this represents schema evolution, it should be possible (but may be a current bug or limitation) to set the reader schema to the most recent schema and resolve all files as that schema.
I currently run M/R jobs (but not using Avro's mapreduce package -- its a custom Pig reader) over sets of Avro data files that contain a schema that has evolved over time -- at least two dozen variants. The reader uses the most recent version, and we have been careful to make sure that our schema has evolved over time in a way that maintains compatibility. On 5/11/11 11:44 AM, "Markus Weimer" <[email protected]> wrote: >Hi, > >I'd like to write a mapreduce job that uses avro throughout, but the map >phase would need to read files with two different schemas, similar to >what the MultipleInputFormat does in stock hadoop. Is this a supported >use case? > >A work-around would be to create a union schema that has both fields as >optional and to convert all data into it, but that seems clumsy. > >Has anyone done this before? > >Thanks for any suggestion you can give, > >Markus >
