Re: Hi,all. How can I involve two avro files with different schema into one M/R job?

Doug Cutting Fri, 18 Mar 2011 09:51:34 -0700

On 03/17/2011 08:13 PM, 幻 wrote:
>      Currently,I have two avro files with different schema. I found that
> I have to set the schema before running a M/R job if the files are in
> avro format.But the schema of the files are probably not the same.How
> can I do that without setting the schema before running a job? Thanks.


The schema you set for the job is the reader's schema.  The schema in
the input files is the writer's schema and not match this exactly.  It
will be projected to the reader's schema, as described in the
specification, particularly in the "Schema Resolution" section.

http://avro.apache.org/docs/current/spec.html#Schema+Resolution

The aliases section is also relevant:

http://avro.apache.org/docs/current/spec.html#Aliases

This can be used to extract fields from different schemas into a common
data structure.  For example, if your input files use the following two
schemas:

{"type":"record", "name":"a.A", "fields":[{"name":"foo", "type":"int"}]}
{"type":"record", "name":"b.B", "fields":[{"name":"bar", "type":"int"}]}

then the following record can read both:

{"type":"record", "name":"my.MapInput",
 "aliases":["a.A","b.B"],
 "fields":[{"name":"x", "type":"int", "aliases":["foo","bar"]}]
}

The reader's schema can thus include a common subset of fields in
inputs.  It can map fields of compatible types that are named
differently to a common field.  It can include fields that are not in
all inputs, so long as they have a default value in the reader's schema.
 It could include all data from all inputs, e.g., in the above case:

{"type":"record", "name":"my.MapInput",
 "aliases":["a.A","b.B"],
 "fields":[
   {"name":"foo", "type":"int", "default": -1},
   {"name":"bar", "type":"int", "default": -1},
  ]
}

So there's a fair amount of flexibility available.

Doug

Re: Hi,all. How can I involve two avro files with different schema into one M/R job?

Reply via email to