Re: MapReduce: Using Avro Input/Output Formats without Specifying a schema

Ryan Tabora Wed, 30 Apr 2014 06:47:57 -0700

Thanks Rao, I understand how I could do it if I had a single schema across all 
input data. However, my question is if my input data will vary and one input 
could have a different schema from another.


My idea would be to use something like MultipleOutputs or partitioning to split 
up the output data by unique schema. 

I guess the question still stands, does anyone have any recommendations for 
dynamically generating the schema using Avro output formats?

Thanks,
Ryan Tabora
http://ryantabora.com

On April 29, 2014 at 11:41:51 PM, Fengyun RAO ([email protected]) wrote:

take MapReduce for example, which requires Runner, Mapper, Reducer

the Mapper requires outputting a single Type (or a single Avro schema). 

If you have a set of CSV files with different schemas, what output type would 
you expect?

If all the CSV files share the same schema, you could dynamically create the 
schema in the Runner before submitting a MR job.
If you look into the Schema.java, you would find create(), createRecord(), etc. 
APIs.
you could simply read one CSV file head, and create the schema using these APIs.
e.g. 
    AvroJob.setMapOutputKeySchema(job, Schema.create(Schema.Type.STRING));
creates a schema with only a String field.



2014-04-30 4:56 GMT+08:00 Ryan Tabora <[email protected]>:
Hi all,

Whether you’re using Hive or MapReduce, avro input/output formats require you 
to specify a schema at the beginning of the job or the table definition in 
order to work with them. Is there any way to configure the jobs in a way that 
the input/output formats can dynamically determine the schema from the data 
itself?

Think about a job like this. I have a set of CSV files that I want to serialize 
into avro files. These CSV files are self describing and each CSV file has a 
unique schema. If I want to write a job that scans over all of this data and 
serialize it into avro I can’t do that with today’s tools (as far as I know). 
If I can’t specify the schema up front, what can I do? Am I forced to write my 
own avro input/output formats?

The avro schema is stored within the avro data file itself, why can’t these 
input/output formats be smart enough to figure that out? Am I fundamentally 
doing something against the principles of the avro format? I would be surprised 
if no one has run into this issue before.

Regards,
Ryan Tabora

Re: MapReduce: Using Avro Input/Output Formats without Specifying a schema

Reply via email to