MapReduce: Using Avro Input/Output Formats without Specifying a schema

Ryan Tabora Tue, 29 Apr 2014 13:58:23 -0700

Hi all,

Whether you’re using Hive or MapReduce, avro input/output formats require you 
to specify a schema at the beginning of the job or the table definition in 
order to work with them. Is there any way to configure the jobs in a way that 
the input/output formats can dynamically determine the schema from the data 
itself?


Think about a job like this. I have a set of CSV files that I want to serialize 
into avro files. These CSV files are self describing and each CSV file has a 
unique schema. If I want to write a job that scans over all of this data and 
serialize it into avro I can’t do that with today’s tools (as far as I know). 
If I can’t specify the schema up front, what can I do? Am I forced to write my 
own avro input/output formats?

The avro schema is stored within the avro data file itself, why can’t these 
input/output formats be smart enough to figure that out? Am I fundamentally 
doing something against the principles of the avro format? I would be surprised 
if no one has run into this issue before.

Regards,
Ryan Tabora

MapReduce: Using Avro Input/Output Formats without Specifying a schema

Reply via email to