On 08/18/2010 10:18 AM, ey-chih chow wrote:
Thanks. But by doing this way, what kind of advantage we can get from Avro?
The Avro MapReduce API is easiest to use when both inputs and outputs are Avro data.
If inputs are not Avro data, but you want to use the rest of the Avro MR API, then you'd need to write an InputFormat that produces an AvroWrapper<T> where T is a type that Avro can serialize.
Another alternative might be to first convert your inputs to be avro data files. For example, one can use Avro's 'fromtext' tool to convert line-oriented files into equivalent compressed, splittable, Avro data files. This could be done as log files are loaded into HDFS, since this tool accepts Hadoop paths as output.
We hope to add more such tools for such conversion/ingest, e.g.: https://issues.apache.org/jira/browse/AVRO-458 We also expect that systems like Flume will produce Avro data files. Doug
