Re: Hadoop Serialization: Avro

Leonardo Urbina Tue, 29 Nov 2011 07:23:48 -0800

Thanks! That is very helpful,

Best,
-Leo


On Mon, Nov 28, 2011 at 2:55 AM, Friso van Vollenhoven <
[email protected]> wrote:

>  Hi Leo,
>
>  If you want everything to be vanilla Hadoop MapReduce and just want your
> output to be a Avro readable file, then I don't think the standard Avro MR
> support has that for you.
>
>  What you would need to do is:
> - Set you job's output format to AvroOutputFormat.class
> - Set "avro.output.schema" to the output schema that you want to use (the
> json representation). This must be a Pair schema.
> - Optionally set "avro.output.codec" to enable compression.
> - Create a reducer for your job like this (this is new API style):
> class MyReducer Reducer<K, V, AvroWrapper<OUT>, NullWritable> {
> …implementation
> }
>
>  The K and V would be the map output key and value types. OUT typically
> is something like Pair<MyKey, MyValue> where MyKey and MyValue are classes
> generated by Avro. This would write a Avro file that you can use as input
> again for a subsequent job using AvroInputFormat.
>
>  For what you are trying to achieve, you could probably draw some
> inspiration from the implementation of the Avro mapred support. Have a look
> at the code for AvroJob, HadoopMapper and HadoopReducer in
> org.apache.avro.mapred, as they form the actual bridge between Avro and
> Hadoop. Source is browsable here:
> https://github.com/apache/avro/tree/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred
>
>  For the docs on how to work with the Avro mapred support, see the
> package description of the org.apache.avro.mapred package (
> http://avro.apache.org/docs/1.6.1/api/java/org/apache/avro/mapred/package-summary.html
> ).
>
>
>  Hope that helps,
> Friso
>
>
>  On 28 nov. 2011, at 00:03, Leonardo Urbina wrote:
>
> Hey everyone,
>
>  First time posting to the list> I have posted this in the hadoop user
> mailing list and haven't gotten any responses yet. Any help would be
> appreciated.
>
>  I'm currently writing a hadoop job that will run daily and whose output
> will be part of the part of the next day's input. Also, the output will
> potentially be read by other programs for later analysis. Since my
> program's output is used as part of the next day's input, it would be nice
> if it was stored in some binary format that is easy to read the next time
> around. But this format also needs to be readable by other outside
> programs, not necessarily written in Java. After searching for a while it
> seems that Avro is what I want to be using. In any case, I have been
> looking around for a while and I can't seem to find a single example of how
> to use Avro within a Hadoop job.
>
>  It seems that in order to use Avro I need to change
> the io.serializations value, however I don't know which value should be
> specified. Furthermore, I found that there are classes 
> Avro{Input,Output}Format
> but these use a series of other Avro classes which, as far as I
> understand, seem need the use of other Avro classes such as AvroWrapper,
> AvroKey, AvroValue, and as far as I am concerned Avro* (with * replaced
> with pretty much any Hadoop class name). It seems however that these are
> used so that the Avro format is used throughout the Hadoop process to
> pass objects around.
>
>  I just want to use Avro to save my output and read it again as input
> next time around. So far I have been using
> SequenceFile{Input,Output}Format, and have implemented the Writable
> interface in the relevant classes, however this is not portable to other
> languages. Is there a way to use Avro without a substantial rewrite
> (using Avro* classes) of my Hadoop job? Thanks in advance,
>
>  Best,
>  -Leo
>
>  --
> Leo Urbina
> Massachusetts Institute of Technology
> Department of Electrical Engineering and Computer Science
> Department of Mathematics
> [email protected]
>
>
>


-- 
Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics
[email protected]

Re: Hadoop Serialization: Avro

Reply via email to