Hi Vikesh, It seems that you are trying to run benchmarks on giraph.We had a lot of improvements in 1.1.0-SNAPSHOT - (though it is not released publicly in maven at Facebook we run all our applications on the snapshot version)So, you can pull the latest trunk from giraph: git clone https://git-wip-us.apache.org/repos/asf/giraph.git And then try running some applications. [you are correct, we store hostnames-taskid mappings in the beginning of the run, so u can see such failures] Date: Mon, 7 Apr 2014 16:27:09 -0700 From: [email protected] To: [email protected] Subject: [Solved] Giraph job hangs indefinitely and is eventually killed by JobTracker
Hi, Thanks for the help! Turns out this was happening because /etc/hosts had an outdated IP address (dynamic) for the host that was being used as the master. Giraph was probably failing to communicate with the master throughout and getting stuck indefinitely. Thanks,Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University From: "Vikesh Khanna" <[email protected]> To: [email protected] Sent: Monday, April 7, 2014 2:58:13 PM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker @Pankaj, I am running the ShortestPath example on a tiny graph now (5 nodes). That is also getting hung indefinitely the exact same way. This machine has 1 TB of memory and I have used -Xmx25g (25 GB) as Java options. So hopefully it should not be because of memory limitation. [(free/total/max) = 1706.68M / 1979.75M / 25242.25M] @Lukas, I am trying to run the example packaged with the Giraph installation - SimpleShortestPathsVertex. I haven't written any code myself yet - just trying to get this to work first. I am not getting any memory exception - no dump file is being generated at the DumpPath. $HADOOP_HOME/bin/hadoop jar ~/.local/bin/giraph-examples.jar org.apache.giraph.GiraphRunner -D giraph.logLevel="all" -libjars ~/.local/bin/giraph-core.jar org.apache.giraph.examples.SimpleShortestPathsVertex -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/vikesh/input/tiny_graph.txt -of org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /user/vikesh/shortestPaths8 -ca SimpleShortestPathsVertex.source=2 -w 1 I am printing debug level logs now, and I am seeing these calls indefinitely in both the zookeeper and worker tasks - 2014-04-07 14:45:32,325 DEBUG org.apache.hadoop.ipc.RPC: Call: statusUpdate 8 2014-04-07 14:45:35,326 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to /127.0.0.1:45894 from job_201404071443_0001 sending #34 2014-04-07 14:45:35,327 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to /127.0.0.1:45894 from job_201404071443_0001 got value #34 2014-04-07 14:45:35,327 DEBUG org.apache.hadoop.ipc.RPC: Call: ping 2 2014-04-07 14:45:38,328 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to /127.0.0.1:45894 from job_201404071443_0001 sending #35 2014-04-07 14:45:38,329 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) connection to /127.0.0.1:45894 from job_201404071443_0001 got value #35 2014-04-07 14:45:38,329 DEBUG org.apache.hadoop.ipc.RPC: Call: ping 1 2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Got timed signaled of false 2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Wait for 0 2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Got timed signaled of false 2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: Wait for 0These calls go on for 10 minutes and then the job is killed by Hadoop. Thanks,Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University From: "Lukas Nalezenec" <[email protected]> To: [email protected] Sent: Monday, April 7, 2014 4:13:23 AM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker Hi, Try making and analyzing memory dump after exception (JVM param -XX:+HeapDumpOnOutOfMemoryError) What configuration (mainly Partition class) do you use ? Lukas On 7.4.2014 11:45, Vikesh Khanna wrote: Hi, Any ideas why Giraph waits indefinitely? I've been stuck on this for a long time now. Thanks, Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University From: "Vikesh Khanna" <[email protected]> To: [email protected] Sent: Friday, April 4, 2014 6:06:51 AM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker Hi Avery, I tried both the options. It does appear to be a GC problem. The problem continues with the second option as well :(. I have attached the logs after enabling the first set of options and using 1 worker. Would be very helpful if you can take a look. This machine has 1 TB memory. We ran benchmarks of various other graph libraries on this machine and they worked fine (even with graphs 10x larger than the Giraph PageRank benchmark - 40 million nodes). I am sure Giraph would work fine as well - this should not be a resource constraint. Thanks, Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University From: "Avery Ching" <[email protected]> To: [email protected] Sent: Thursday, April 3, 2014 7:26:56 PM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker This is for a single worker it appears. Most likely your worker went into GC and never returned. You can try with GC settings turned on, try adding something like. -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc You could also try the concurrent mark/sweep collector. -XX:+UseConcMarkSweepGC Any chance you can use more workers and/or get more memory? Avery On 4/3/14, 5:46 PM, Vikesh Khanna wrote: @Avery, Thanks for the help. I checked out the task logs, and turns out there was an exception "GC overhead limit exceeded" due to which the benchmarks wouldn't even load the vertices. I got around it by increasing the heap size (mapred.child.java.opts) in mapred-site.xml. The benchmark is loading vertices now. However, the job is still getting stuck indefinitely (and eventually killed). I have attached the small log for the map task on 1 worker. Would really appreciate if you can help understand the cause. Thanks, Vikesh Khanna, Masters, Computer Science (Class of 2015) Stanford University From: "Praveen kumar s.k" <[email protected]> To: [email protected] Sent: Thursday, April 3, 2014 4:40:07 PM Subject: Re: Giraph job hangs indefinitely and is eventually killed by JobTracker You have given -w 30, make sure that that many number of map tasks are configured in your cluster On Thu, Apr 3, 2014 at 6:24 PM, Avery Ching <[email protected]> wrote: > My guess is that you don't get your resources. It would be very helpful to > print the master log. You can find it when the job is running to look at > the Hadoop counters on the job UI page. > > Avery > > > On 4/3/14, 12:49 PM, Vikesh Khanna wrote: > > Hi, > > I am running the PageRank benchmark under giraph-examples from giraph-1.0.0 > release. I am using the following command to run the job (as mentioned here) > > vikesh@madmax > /lfs/madmax/0/vikesh/usr/local/giraph/giraph-examples/src/main/java/org/apache/giraph/examples > $ $HADOOP_HOME/bin/hadoop jar > $GIRAPH_HOME/giraph-core/target/giraph-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar > org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 50000000 -w 30 > > > However, the job gets stuck at map 9% and is eventually killed by the > JobTracker on reaching the mapred.task.timeout (default 10 minutes). I tried > increasing the timeout to a very large value, and the job went on for over 8 > hours without completion. I also tried the ShortestPathsBenchmark, which > also fails the same way. > > > Any help is appreciated. > > > ****** ---------------- *********** > > > Machine details: > > Linux version 2.6.32-279.14.1.el6.x86_64 > ([email protected]) (gcc version 4.4.6 20120305 (Red Hat > 4.4.6-4) (GCC) ) #1 SMP Tue Nov 6 23:43:09 UTC 2012 > > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 64 > On-line CPU(s) list: 0-63 > Thread(s) per core: 1 > Core(s) per socket: 8 > CPU socket(s): 8 > NUMA node(s): 8 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 47 > Stepping: 2 > CPU MHz: 1064.000 > BogoMIPS: 5333.20 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 256K > L3 cache: 24576K > NUMA node0 CPU(s): 1-8 > NUMA node1 CPU(s): 9-16 > NUMA node2 CPU(s): 17-24 > NUMA node3 CPU(s): 25-32 > NUMA node4 CPU(s): 0,33-39 > NUMA node5 CPU(s): 40-47 > NUMA node6 CPU(s): 48-55 > NUMA node7 CPU(s): 56-63 > > > I am using a pseudo-distributed Hadoop cluster on a single machine with > 64-cores. > > > *****-------------******* > > > Thanks, > Vikesh Khanna, > Masters, Computer Science (Class of 2015) > Stanford University > > >
