Avery, Is there an example of overriding the partitioner in the giraph 0.1 distribution?
Thanks, Jon On Tue, Jul 17, 2012 at 11:00 AM, Avery Ching <[email protected]> wrote: > Answers inline. > > > On 7/17/12 1:22 AM, Nicolas DUGUE wrote: > >> Thanks for your answer David ! >> >> Okay, but, is there a way to force Giraph to partition the Graph in our >> own way and how to do that ? It may be useful to minimize communication >> between Giraph nodes. >> >> The partitioning method is very customizable. See > GraphPartitionerFactory as the interface you need to implement. > HashPartitionerFactory is what we use as the default, but you can implement > your own. > > > You're talking about starting the job with a minimum of vertices and add >> new vertices then. It seems really interesting, how to do that and how does >> it work ? >> > The graph is mutable as the application is running. See MutableVertex for > all the local and remote mutations you can make. > > > For example, I run my Giraph job with half of the vertices and during my >> first superstep, I add (I don't know how) some vertices to my file. Will >> these vertices be taken in account for my first superstep or just for the >> next superstep. >> And when the vertices are loaded, is it possible to remove it from the >> memory ? In other words, I can add new vertices, can I remove vertices too >> ? So, is it possible to change the topology of my graph dynamically ? >> >> Yes, see above. > > > Moreover, I'm still wondering what is the best ? Launching one VM with >> Giraph on each server and with 20GB of Ram OR launching two of its with >> 10GB of RAM for each ? >> >> Well, in that case, I'm guessing one server with 20 GB since there would > be no communication (most of the effort). > > > And finally, when I launch a Giraph Job, Zookeeper is loaded in one >> virtual machine alone... Is there a way to run some Giraph jobs in this >> virtual machine too ? Or to mention explicitely in which VM running the >> ZooKeeper Job ? >> >> ZooKeeper runs in the same slot as the master process, not sure you'd > want to do more there as it's best to balance the memory usage across the > workers. > > Best regards, >> Nicolas >> >> On 16/07/2012 21:51, David Garcia wrote: >> >>> Giraph partitions the vertices using a hashing function that's basically >>> the equivalent of (hash(vertexID) mod #ofComputeNodes). >>> You can mitigate memory issues by starting the job with a minimum of >>> vertices in your file and then add them dynamically as your job >>> progresses >>> (assuming that your job doesn't require all of the vertices). >>> >>> -David >>> >>> >>> On 7/16/12 4:36 AM, "Nicolas DUGUE" <[email protected]**> >>> wrote: >>> >>> Hi everybody, >>>> >>>> I'm new to Giraph so I have a few questions about how it works and >>>> so how to configure it to make it work as well as possible. >>>> We have settled a cluster of 6 servers with 24 cpu, 24GB of RAM and >>>> we want to use it to experiment with Giraph. >>>> Currently, we've made a few runs and we have some problems with >>>> memory, it seems that we don't give enough of it to the JVM (GC >>>> overhead, OutOfMemory, ...). >>>> Our experiments were benchmarks using the PageRank, we only succeed >>>> in running it on a 100 millions edges graph by running two virtual >>>> machines with 8GB of Ram on each of our server. >>>> >>>> Here are our questions : >>>> - What is the best ? Launching one VM with Giraph on each server >>>> and with 20GB of Ram OR launching two of its with 10GB of RAM for each ? >>>> - Are there a way to minimize the memory used by Hadoop to give >>>> more memory to the Giraph jobs ? >>>> - How is the graph distributed across the cluster ? Our graph may >>>> be a power-law graph with a few nodes with a very large amount of edges >>>> and a lot of nodes with a few edges. How Giraph will distribute this >>>> kind of graph ? Does it take in account the number of edges of each >>>> vertice ? >>>> >>>> Thanks in advance, >>>> Nicolas Dugué >>>> PhD student at the Univeristy of Orléans >>>> >>> >> >> > >
