On 03/17/2011 08:13 PM, 幻 wrote: > Currently,I have two avro files with different schema. I found that > I have to set the schema before running a M/R job if the files are in > avro format.But the schema of the files are probably not the same.How > can I do that without setting the schema before running a job? Thanks.
The schema you set for the job is the reader's schema. The schema in the input files is the writer's schema and not match this exactly. It will be projected to the reader's schema, as described in the specification, particularly in the "Schema Resolution" section. http://avro.apache.org/docs/current/spec.html#Schema+Resolution The aliases section is also relevant: http://avro.apache.org/docs/current/spec.html#Aliases This can be used to extract fields from different schemas into a common data structure. For example, if your input files use the following two schemas: {"type":"record", "name":"a.A", "fields":[{"name":"foo", "type":"int"}]} {"type":"record", "name":"b.B", "fields":[{"name":"bar", "type":"int"}]} then the following record can read both: {"type":"record", "name":"my.MapInput", "aliases":["a.A","b.B"], "fields":[{"name":"x", "type":"int", "aliases":["foo","bar"]}] } The reader's schema can thus include a common subset of fields in inputs. It can map fields of compatible types that are named differently to a common field. It can include fields that are not in all inputs, so long as they have a default value in the reader's schema. It could include all data from all inputs, e.g., in the above case: {"type":"record", "name":"my.MapInput", "aliases":["a.A","b.B"], "fields":[ {"name":"foo", "type":"int", "default": -1}, {"name":"bar", "type":"int", "default": -1}, ] } So there's a fair amount of flexibility available. Doug
