Wow not sure how I missed this, thank you! :) Regards, Ryan Tabora http://ryantabora.com
On Wed, Apr 30, 2014 at 9:41 PM, Fengyun RAO <[email protected]> wrote: > We also used AvroMultipleOutputs to deal with multiple schemas. > > the problem stands the same, you have to set a single mapper output > type (or schema) before submitting the MR job. since there are > multiple schemas, we used Schema.createUnion(List<Schema> types) as > the mapper output schema. > > you could write a method to generate the list of schemas from the > input data, before submitting the MR job. > > 2014-04-30 21:46 GMT+08:00, Ryan Tabora <[email protected]>: > > Thanks Rao, I understand how I could do it if I had a single schema > across > > all input data. However, my question is if my input data will vary and > one > > input could have a different schema from another. > > > > My idea would be to use something like MultipleOutputs or partitioning to > > split up the output data by unique schema. > > > > I guess the question still stands, does anyone have any recommendations > for > > dynamically generating the schema using Avro output formats? > > > > Thanks, > > Ryan Tabora > > http://ryantabora.com > > > > On April 29, 2014 at 11:41:51 PM, Fengyun RAO ([email protected]) > wrote: > > > > take MapReduce for example, which requires Runner, Mapper, Reducer > > > > the Mapper requires outputting a single Type (or a single Avro schema). > > > > If you have a set of CSV files with different schemas, what output type > > would you expect? > > > > If all the CSV files share the same schema, you could dynamically create > the > > schema in the Runner before submitting a MR job. > > If you look into the Schema.java, you would find create(), > createRecord(), > > etc. APIs. > > you could simply read one CSV file head, and create the schema using > these > > APIs. > > e.g. > > AvroJob.setMapOutputKeySchema(job, > Schema.create(Schema.Type.STRING)); > > creates a schema with only a String field. > > > > > > > > 2014-04-30 4:56 GMT+08:00 Ryan Tabora <[email protected]>: > > Hi all, > > > > Whether you’re using Hive or MapReduce, avro input/output formats require > > you to specify a schema at the beginning of the job or the table > definition > > in order to work with them. Is there any way to configure the jobs in a > way > > that the input/output formats can dynamically determine the schema from > the > > data itself? > > > > Think about a job like this. I have a set of CSV files that I want to > > serialize into avro files. These CSV files are self describing and each > CSV > > file has a unique schema. If I want to write a job that scans over all of > > this data and serialize it into avro I can’t do that with today’s tools > (as > > far as I know). If I can’t specify the schema up front, what can I do? > Am I > > forced to write my own avro input/output formats? > > > > The avro schema is stored within the avro data file itself, why can’t > these > > input/output formats be smart enough to figure that out? Am I > fundamentally > > doing something against the principles of the avro format? I would be > > surprised if no one has run into this issue before. > > > > Regards, > > Ryan Tabora > > > > > > > -- > ---------------------------------------------------------------- > RAO Fengyun > Center for Astrophysics, Tsinghua University > Tel: +86 13810626496 > Email: [email protected] > [email protected] > ----------------------------------------------------------------- >
