Thanks André, yes that file helps a lot. I changed a couple of those things now to suit my application. I'm able to save checkpoints every 50 supersteps (changed CLEANUP_CHECKPOINTS_AFTER_SUCCESS_DEFAULT to false so I can see the files).
How do I "manually" restart from say step 100 even though the job has finished successfully? Change string variable RESTART_SUPERSTEP to _bsp/_checkpoints/job_201208071105_0007/100.finalized? I'm assuming when it actually fails it will restart automatically from the previous checkpoint. Thank you. Vishal On Sun, Aug 12, 2012 at 3:57 AM, André Kelpe <[email protected]>wrote: > Hi Vishal, > > you can control the checkpoint frequency with the setting > "giraph.checkpointFrequency" in your JobConfiguration. The default is > set to 0 right now, meaning no checkpoints are made. You should def. > check out the GiraphJob [0] code, where all these tuning knobs are > documented. > > --André > > [0] > https://github.com/apache/giraph/blob/trunk/src/main/java/org/apache/giraph/graph/GiraphJob.java#L308 > > 2012/8/11 Vishal Patel <[email protected]>: > > Hi, > > > > How do I specify the interval for saving checkpoints? When working with > > Amazon's Elastic Mapreduce on a large number of workers (> 80 workers, > 40 x > > m1.xlarge machines), sometimes there is RPC communication errors and > > Zookeeper waits on that worker for a while before timing out and killing > the > > job all together. > > > > As my graph and number of workers is becoming larger I would like to > learn > > how to save it since that extra cost might be well worth it-- say every > 50 > > supersteps. Here is the command I use currently, how should I modify it. > > > > hadoop jar giraph-0.2-SNAPSHOT-jar-with-dependencies.jar > > org.apache.giraph.GiraphRunner > > org.apache.giraph.examples.ConnectedComponentsVertex \ > > --inputFormat org.apache.giraph.examples.IntIntNullIntTextInputFormat \ > > --inputPath giraph_in/adj_list.txt \ > > --outputFormat > > org.apache.giraph.examples.VertexWithComponentTextOutputFormat \ > > --outputPath giraph_out > > --combiner org.apache.giraph.examples.MinimumIntCombiner > > --workers 95 > > > > Also, how do I restart from a specific checkpoint. The help for the > > GiraphRunner class did not have instructions on this. > > > > Thank you! > > > > Vishal > > > > >
