Giraph partitions the vertices using a hashing function that's basically the equivalent of (hash(vertexID) mod #ofComputeNodes). You can mitigate memory issues by starting the job with a minimum of vertices in your file and then add them dynamically as your job progresses (assuming that your job doesn't require all of the vertices).
-David On 7/16/12 4:36 AM, "Nicolas DUGUE" <[email protected]> wrote: >Hi everybody, > > I'm new to Giraph so I have a few questions about how it works and >so how to configure it to make it work as well as possible. > We have settled a cluster of 6 servers with 24 cpu, 24GB of RAM and >we want to use it to experiment with Giraph. > Currently, we've made a few runs and we have some problems with >memory, it seems that we don't give enough of it to the JVM (GC >overhead, OutOfMemory, ...). > Our experiments were benchmarks using the PageRank, we only succeed >in running it on a 100 millions edges graph by running two virtual >machines with 8GB of Ram on each of our server. > > Here are our questions : > - What is the best ? Launching one VM with Giraph on each server >and with 20GB of Ram OR launching two of its with 10GB of RAM for each ? > - Are there a way to minimize the memory used by Hadoop to give >more memory to the Giraph jobs ? > - How is the graph distributed across the cluster ? Our graph may >be a power-law graph with a few nodes with a very large amount of edges >and a lot of nodes with a few edges. How Giraph will distribute this >kind of graph ? Does it take in account the number of edges of each >vertice ? > >Thanks in advance, >Nicolas Dugué >PhD student at the Univeristy of Orléans
