Hi Tu, first of all, I really suggest you run trunk, especially if you have a large graph. That being said:
1) yes and no, the jargon is misleading. you should have n - 1 workers (what you call mappers for giraph job) with n as the max number of mappers you can have in your cluster as an upper limit (the additional 1 goes for the master). In general, i'd strongly suggest you have 1 mapper/worker per node/MACHINE, and k compute threads per worker, with k as the number of cores on that machine. You'll save netty sending messages over the loopback and additional jvm overhead. 2) yes, but I challenge you to compute those sizes before hand :) Also consider the size of the messages being produced by your algorithm. E.g. roughly, PageRank produces a double for each edge in the graph, during each superstep. 3) AFAIK there's no way, but I might be wrong here. 4) I'd suggest you also talk in terms of nodes. Having multiple workers per machine misleads the scalability on certain aspects (such as network i/o). I have been running Giraph jobs on hundreds of mappers and around 65 machines. I know others here have done bigger numbers (~300 workers). I'd say the upper limit to scalability is your main memory ATM, so you might want to have a look at out-of-core graph and messages. Hope it helps, Claudio On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <[email protected]> wrote: > Hi, > > I have some general scalability questions for Giraph. Based on the > Giraph design, I am assuming all the mappers in giraph job should be > running at the same time. > > If so, then > > 1. The max mappers for giraph job <= total mapper slots in the whole > cluster > 2. The max data input size to giraph should be <= total mapper slots * > mapper memory limit > 3. If the total mapper slot in the cluster is 200 and only 100 mappers > is currently available, and the giraph job require 150 mappers > 1. Without any configuration change, the 100 mappers of the giraph > will be started but the giraph job will NOT run successfully > 2. Is there any configuration in Giraph to start the job ONLY at > them time when all the mapper slot available? > 4. How is the scalability in giraph? I can ONLY run up to 150 mappers > for my giraph job. Does anyone run a large giraph job in large cluster > successfully? > 1. I am using giraph 0.1 in my cluster > > > Thanks a lot for your time and inputs. > > Min > -- Claudio Martella [email protected]
