Re: improve performance of avro map reduce jobs

Ron Bodkin Sat, 25 Jun 2011 07:27:46 -0700

To me, Avro offers a big benefit to MapReduce jobs by having a well-defined
file format that defines the schema for records that was used to write it,
is splittable, compressable, and has metadata. Ultimately, I'd like to see
that and a binding layer on top of data serialization and more flexibility
for serialization in Avro (e.g., why not be able to use its APIs but
Protobuf for binding).


That being said, I'm curious where you see the CPU going in your jobs? Where
in the Avro serialization is it spending its time? I suspect that
GenericData isn't nearly as performant as using codegen.

On Sat, Jun 25, 2011 at 6:27 AM, ey-chih chow <[email protected]> wrote:

>  Let me put the question in another way.  Companies like Twitter they use
> Protocol Buffer as their serialization tool.  It seems to have better
> performance.  Is there any compelling reason that Avro can do and Protocol
> Buffer cannot ?  Thanks.
>
> Ey-Chih
>
> ------------------------------
> From: [email protected]
> To: [email protected]
> Subject: improve performance of avro map reduce jobs
> Date: Fri, 24 Jun 2011 16:55:58 -0700
>
>
>  Our Map/Reduce jobs are all based on avro.  We would like to enhance their
> performance.  The objects collected in our mappers and reducers are mainly
> of the type GenericData.Record.  Currently, most of jobs are CPU, rather
> than IO, bound.  Can anybody suggest ways to improve the performance of the
> jobs?  Thanks a lot.
>
> Ey-Chih Chow
>



-- 

Ron Bodkin
CEO
Think Big Analytics <http://www.thinkbiganalytics.com>
m: +1 (415) 509-2895
@ronbodkin <http://twitter.com/#!/ronbodkin>

Re: improve performance of avro map reduce jobs

Reply via email to