Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Doug Cutting Wed, 18 Aug 2010 10:38:03 -0700

On 08/18/2010 10:18 AM, ey-chih chow wrote:

Thanks. But by doing this way, what kind of advantage we can get from Avro?

The Avro MapReduce API is easiest to use when both inputs and outputsare Avro data.

If inputs are not Avro data, but you want to use the rest of the Avro MRAPI, then you'd need to write an InputFormat that produces anAvroWrapper<T> where T is a type that Avro can serialize.

Another alternative might be to first convert your inputs to be avrodata files. For example, one can use Avro's 'fromtext' tool to convertline-oriented files into equivalent compressed, splittable, Avro datafiles. This could be done as log files are loaded into HDFS, since thistool accepts Hadoop paths as output.


We hope to add more such tools for such conversion/ingest, e.g.:

https://issues.apache.org/jira/browse/AVRO-458

We also expect that systems like Flume will produce Avro data files.

Doug

Re: how to specify MultipleOutputs, MultipleInputs in using Avro mapred API

Reply via email to