Hi Leo,
If you want everything to be vanilla Hadoop MapReduce and just want your output
to be a Avro readable file, then I don't think the standard Avro MR support has
that for you.
What you would need to do is:
- Set you job's output format to AvroOutputFormat.class
- Set "avro.output.schema" to the output schema that you want to use (the json
representation). This must be a Pair schema.
- Optionally set "avro.output.codec" to enable compression.
- Create a reducer for your job like this (this is new API style):
class MyReducer Reducer<K, V, AvroWrapper<OUT>, NullWritable> {
…implementation
}
The K and V would be the map output key and value types. OUT typically is
something like Pair<MyKey, MyValue> where MyKey and MyValue are classes
generated by Avro. This would write a Avro file that you can use as input again
for a subsequent job using AvroInputFormat.
For what you are trying to achieve, you could probably draw some inspiration
from the implementation of the Avro mapred support. Have a look at the code for
AvroJob, HadoopMapper and HadoopReducer in org.apache.avro.mapred, as they form
the actual bridge between Avro and Hadoop. Source is browsable here:
https://github.com/apache/avro/tree/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred
For the docs on how to work with the Avro mapred support, see the package
description of the org.apache.avro.mapred package
(http://avro.apache.org/docs/1.6.1/api/java/org/apache/avro/mapred/package-summary.html).
Hope that helps,
Friso
On 28 nov. 2011, at 00:03, Leonardo Urbina wrote:
Hey everyone,
First time posting to the list> I have posted this in the hadoop user mailing
list and haven't gotten any responses yet. Any help would be appreciated.
I'm currently writing a hadoop job that will run daily and whose output will be
part of the part of the next day's input. Also, the output will potentially be
read by other programs for later analysis. Since my program's output is used as
part of the next day's input, it would be nice if it was stored in some binary
format that is easy to read the next time around. But this format also needs to
be readable by other outside programs, not necessarily written in Java. After
searching for a while it seems that Avro is what I want to be using. In any
case, I have been looking around for a while and I can't seem to find a single
example of how to use Avro within a Hadoop job.
It seems that in order to use Avro I need to change the io.serializations
value, however I don't know which value should be specified. Furthermore, I
found that there are classes Avro{Input,Output}Format but these use a series of
other Avro classes which, as far as I understand, seem need the use of other
Avro classes such as AvroWrapper, AvroKey, AvroValue, and as far as I am
concerned Avro* (with * replaced with pretty much any Hadoop class name). It
seems however that these are used so that the Avro format is used throughout
the Hadoop process to pass objects around.
I just want to use Avro to save my output and read it again as input next time
around. So far I have been using SequenceFile{Input,Output}Format, and have
implemented the Writable interface in the relevant classes, however this is not
portable to other languages. Is there a way to use Avro without a substantial
rewrite (using Avro* classes) of my Hadoop job? Thanks in advance,
Best,
-Leo
--
Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics
[email protected]<mailto:[email protected]>