Hi Claudio,

Thank you very much for your valuable inputs. I will follow your suggestions to 
try giraph 0.2 ( from trunk ) and the workers setting.

Min

From: Claudio Martella 
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Thursday, February 14, 2013 3:06 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: General Scalability Questions for Giraph

Hi Tu,

first of all, I really suggest you run trunk, especially if you have a large 
graph. That being said:

1) yes and no, the jargon is misleading. you should have n - 1 workers (what 
you call mappers for giraph job) with n as the max number of mappers you can 
have in your cluster as an upper limit (the additional 1 goes for the master). 
In general, i'd strongly suggest you have 1 mapper/worker per node/MACHINE, and 
k compute threads per worker, with k as the number of cores on that machine. 
You'll save netty sending messages over the loopback and additional jvm 
overhead.

2) yes, but I challenge you to compute those sizes before hand :) Also consider 
the size of the messages being produced by your algorithm. E.g. roughly, 
PageRank produces a double for each edge in the graph, during each superstep.

3) AFAIK there's no way, but I might be wrong here.

4) I'd suggest you also talk in terms of nodes. Having multiple workers per 
machine misleads the scalability on certain aspects (such as network i/o). I 
have been running Giraph jobs on hundreds of mappers and around 65 machines. I 
know others here have done bigger numbers (~300 workers). I'd say the upper 
limit to scalability is your main memory ATM, so you might want to have a look 
at out-of-core graph and messages.

Hope it helps,
Claudio


On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min 
<[email protected]<mailto:[email protected]>> wrote:
Hi,

I have some general scalability questions for Giraph. Based on the Giraph 
design, I am assuming all the mappers in giraph job should be running at the 
same time.

If so, then

  1.  The max mappers for giraph job <= total mapper slots in the whole cluster
  2.  The max data input size to giraph should be <= total mapper slots * 
mapper memory limit
  3.  If the total mapper slot in the cluster is 200 and only 100 mappers is 
currently available, and the giraph job require 150 mappers
     *   Without any configuration change, the 100 mappers of the giraph will 
be started but the giraph job will NOT run successfully
     *   Is there any configuration in Giraph to start the job ONLY at them 
time when  all the mapper slot available?
  4.  How is the scalability in giraph? I can ONLY run up to 150 mappers for my 
giraph job. Does anyone run a large giraph job in large cluster successfully?
     *   I am using giraph 0.1 in my cluster

Thanks a lot for your time and inputs.

Min



--
   Claudio Martella
   [email protected]<mailto:[email protected]>

Reply via email to