Hi Claudio, Thank you very much for your valuable inputs. I will follow your suggestions to try giraph 0.2 ( from trunk ) and the workers setting.
Min From: Claudio Martella <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Thursday, February 14, 2013 3:06 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: General Scalability Questions for Giraph Hi Tu, first of all, I really suggest you run trunk, especially if you have a large graph. That being said: 1) yes and no, the jargon is misleading. you should have n - 1 workers (what you call mappers for giraph job) with n as the max number of mappers you can have in your cluster as an upper limit (the additional 1 goes for the master). In general, i'd strongly suggest you have 1 mapper/worker per node/MACHINE, and k compute threads per worker, with k as the number of cores on that machine. You'll save netty sending messages over the loopback and additional jvm overhead. 2) yes, but I challenge you to compute those sizes before hand :) Also consider the size of the messages being produced by your algorithm. E.g. roughly, PageRank produces a double for each edge in the graph, during each superstep. 3) AFAIK there's no way, but I might be wrong here. 4) I'd suggest you also talk in terms of nodes. Having multiple workers per machine misleads the scalability on certain aspects (such as network i/o). I have been running Giraph jobs on hundreds of mappers and around 65 machines. I know others here have done bigger numbers (~300 workers). I'd say the upper limit to scalability is your main memory ATM, so you might want to have a look at out-of-core graph and messages. Hope it helps, Claudio On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <[email protected]<mailto:[email protected]>> wrote: Hi, I have some general scalability questions for Giraph. Based on the Giraph design, I am assuming all the mappers in giraph job should be running at the same time. If so, then 1. The max mappers for giraph job <= total mapper slots in the whole cluster 2. The max data input size to giraph should be <= total mapper slots * mapper memory limit 3. If the total mapper slot in the cluster is 200 and only 100 mappers is currently available, and the giraph job require 150 mappers * Without any configuration change, the 100 mappers of the giraph will be started but the giraph job will NOT run successfully * Is there any configuration in Giraph to start the job ONLY at them time when all the mapper slot available? 4. How is the scalability in giraph? I can ONLY run up to 150 mappers for my giraph job. Does anyone run a large giraph job in large cluster successfully? * I am using giraph 0.1 in my cluster Thanks a lot for your time and inputs. Min -- Claudio Martella [email protected]<mailto:[email protected]>
