I have not done a through investigation.  But from what I was told is that the 
amount of network data push among the nodes of our cluster is very low.  Most 
of the time seems to be local computation.

Date: Sat, 25 Jun 2011 07:27:17 -0700
Subject: Re: improve performance of avro map reduce jobs
From: [email protected]
To: [email protected]

To me, Avro offers a big benefit to MapReduce jobs by having a well-defined 
file format that defines the schema for records that was used to write it, is 
splittable, compressable, and has metadata. Ultimately, I'd like to see that 
and a binding layer on top of data serialization and more flexibility for 
serialization in Avro (e.g., why not be able to use its APIs but Protobuf for 
binding).

That being said, I'm curious where you see the CPU going in your jobs? Where in 
the Avro serialization is it spending its time? I suspect that GenericData 
isn't nearly as performant as using codegen.


On Sat, Jun 25, 2011 at 6:27 AM, ey-chih chow <[email protected]> wrote:






Let me put the question in another way.  Companies like Twitter they use 
Protocol Buffer as their serialization tool.  It seems to have better 
performance.  Is there any compelling reason that Avro can do and Protocol 
Buffer cannot ?  Thanks.

Ey-Chih 

From: [email protected]
To: [email protected]

Subject: improve performance of avro map reduce jobs
Date: Fri, 24 Jun 2011 16:55:58 -0700








Our Map/Reduce jobs are all based on avro.  We would like to enhance their 
performance.  The objects collected in our mappers and reducers are mainly of 
the type GenericData.Record.  Currently, most of jobs are CPU, rather than IO, 
bound.  Can anybody suggest ways to improve the performance of the jobs?  
Thanks a lot.

Ey-Chih Chow                                                                    
                  


-- 
















Ron Bodkin
CEO
Think Big Analytics
m: +1 (415) 509-2895
@ronbodkin


                                          

Reply via email to