Thanks! That is very helpful, Best, -Leo
On Mon, Nov 28, 2011 at 2:55 AM, Friso van Vollenhoven < [email protected]> wrote: > Hi Leo, > > If you want everything to be vanilla Hadoop MapReduce and just want your > output to be a Avro readable file, then I don't think the standard Avro MR > support has that for you. > > What you would need to do is: > - Set you job's output format to AvroOutputFormat.class > - Set "avro.output.schema" to the output schema that you want to use (the > json representation). This must be a Pair schema. > - Optionally set "avro.output.codec" to enable compression. > - Create a reducer for your job like this (this is new API style): > class MyReducer Reducer<K, V, AvroWrapper<OUT>, NullWritable> { > …implementation > } > > The K and V would be the map output key and value types. OUT typically > is something like Pair<MyKey, MyValue> where MyKey and MyValue are classes > generated by Avro. This would write a Avro file that you can use as input > again for a subsequent job using AvroInputFormat. > > For what you are trying to achieve, you could probably draw some > inspiration from the implementation of the Avro mapred support. Have a look > at the code for AvroJob, HadoopMapper and HadoopReducer in > org.apache.avro.mapred, as they form the actual bridge between Avro and > Hadoop. Source is browsable here: > https://github.com/apache/avro/tree/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred > > For the docs on how to work with the Avro mapred support, see the > package description of the org.apache.avro.mapred package ( > http://avro.apache.org/docs/1.6.1/api/java/org/apache/avro/mapred/package-summary.html > ). > > > Hope that helps, > Friso > > > On 28 nov. 2011, at 00:03, Leonardo Urbina wrote: > > Hey everyone, > > First time posting to the list> I have posted this in the hadoop user > mailing list and haven't gotten any responses yet. Any help would be > appreciated. > > I'm currently writing a hadoop job that will run daily and whose output > will be part of the part of the next day's input. Also, the output will > potentially be read by other programs for later analysis. Since my > program's output is used as part of the next day's input, it would be nice > if it was stored in some binary format that is easy to read the next time > around. But this format also needs to be readable by other outside > programs, not necessarily written in Java. After searching for a while it > seems that Avro is what I want to be using. In any case, I have been > looking around for a while and I can't seem to find a single example of how > to use Avro within a Hadoop job. > > It seems that in order to use Avro I need to change > the io.serializations value, however I don't know which value should be > specified. Furthermore, I found that there are classes > Avro{Input,Output}Format > but these use a series of other Avro classes which, as far as I > understand, seem need the use of other Avro classes such as AvroWrapper, > AvroKey, AvroValue, and as far as I am concerned Avro* (with * replaced > with pretty much any Hadoop class name). It seems however that these are > used so that the Avro format is used throughout the Hadoop process to > pass objects around. > > I just want to use Avro to save my output and read it again as input > next time around. So far I have been using > SequenceFile{Input,Output}Format, and have implemented the Writable > interface in the relevant classes, however this is not portable to other > languages. Is there a way to use Avro without a substantial rewrite > (using Avro* classes) of my Hadoop job? Thanks in advance, > > Best, > -Leo > > -- > Leo Urbina > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Department of Mathematics > [email protected] > > > -- Leo Urbina Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Department of Mathematics [email protected]
