RE: [Solved] Giraph job hangs indefinitely and is eventually killed by JobTracker

Pavan Kumar A Mon, 07 Apr 2014 19:57:07 -0700

Hi Vikesh,
It seems that you are trying to run benchmarks on giraph.We had a lot of 
improvements in 1.1.0-SNAPSHOT - (though it is not released publicly in maven 
at Facebook we run all our applications on the snapshot version)So, you can 
pull the latest trunk from giraph: git clone 
https://git-wip-us.apache.org/repos/asf/giraph.git
And then try running some applications.
[you are correct, we store hostnames-taskid mappings in the beginning of the 
run, so u can see such failures]
Date: Mon, 7 Apr 2014 16:27:09 -0700
From: [email protected]
To: [email protected]
Subject: [Solved] Giraph job hangs indefinitely and is eventually killed by 
JobTracker


Hi, 

Thanks for the help! Turns out this was happening because /etc/hosts had an 
outdated IP address (dynamic) for the host that was being used as the master. 
Giraph was probably failing to communicate with the master throughout and 
getting stuck indefinitely.
Thanks,Vikesh Khanna,
Masters, Computer Science (Class of 2015)
Stanford University


From: "Vikesh Khanna" <[email protected]>
To: [email protected]
Sent: Monday, April 7, 2014 2:58:13 PM
Subject: Re: Giraph job hangs indefinitely and is eventually killed by 
JobTracker

@Pankaj, I am running the ShortestPath example on a tiny graph now (5 nodes). 
That is also getting hung indefinitely the exact same way. This machine has 1 
TB of memory and I have used -Xmx25g (25 GB) 
as Java options. So hopefully it should not be because of memory limitation.  
[(free/total/max) = 1706.68M / 1979.75M / 25242.25M]

@Lukas, I am trying to run the example packaged with the Giraph installation - 
SimpleShortestPathsVertex. I haven't written any code myself yet - just trying 
to get this to work first. I am not getting any memory exception - no dump file 
is being generated at the DumpPath.
$HADOOP_HOME/bin/hadoop jar ~/.local/bin/giraph-examples.jar 
org.apache.giraph.GiraphRunner -D giraph.logLevel="all" -libjars 
~/.local/bin/giraph-core.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex -vif 
org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip 
/user/vikesh/input/tiny_graph.txt -of 
org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op 
/user/vikesh/shortestPaths8 -ca SimpleShortestPathsVertex.source=2 -w 1
I am printing debug level logs now, and I am seeing these calls indefinitely in 
both the zookeeper and worker tasks - 2014-04-07 14:45:32,325 DEBUG 
org.apache.hadoop.ipc.RPC: Call: statusUpdate 8
2014-04-07 14:45:35,326 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) 
connection to /127.0.0.1:45894 from job_201404071443_0001 sending #34
2014-04-07 14:45:35,327 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) 
connection to /127.0.0.1:45894 from job_201404071443_0001 got value #34
2014-04-07 14:45:35,327 DEBUG org.apache.hadoop.ipc.RPC: Call: ping 2
2014-04-07 14:45:38,328 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) 
connection to /127.0.0.1:45894 from job_201404071443_0001 sending #35
2014-04-07 14:45:38,329 DEBUG org.apache.hadoop.ipc.Client: IPC Client (47) 
connection to /127.0.0.1:45894 from job_201404071443_0001 got value #35
2014-04-07 14:45:38,329 DEBUG org.apache.hadoop.ipc.RPC: Call: ping 1
2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: 
Got timed signaled of false
2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: 
Wait for 0
2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: 
Got timed signaled of false
2014-04-07 14:45:38,910 DEBUG org.apache.giraph.zk.PredicateLock: waitMsecs: 
Wait for 0These calls go on for 10 minutes and then the job is killed by Hadoop.
Thanks,Vikesh Khanna,
Masters, Computer Science (Class of 2015)
Stanford University


From: "Lukas Nalezenec" <[email protected]>
To: [email protected]
Sent: Monday, April 7, 2014 4:13:23 AM
Subject: Re: Giraph job hangs indefinitely and is eventually killed by 
JobTracker


  
    
  
  
    Hi,

      Try making and analyzing memory dump after exception (JVM
        param -XX:+HeapDumpOnOutOfMemoryError)

      What configuration (mainly Partition class) do you use ?

      Lukas

      

      On 7.4.2014 11:45, Vikesh Khanna wrote:

    
    
      
      
        Hi,
        

        
        Any ideas why Giraph waits indefinitely? I've been stuck on
          this for a long time now. 
        

        
        Thanks,
        Vikesh Khanna,

          Masters, Computer Science (Class of 2015)

          Stanford University

          

        
        

        
        
        From:
          "Vikesh Khanna" <[email protected]>

          To: [email protected]

          Sent: Friday, April 4, 2014 6:06:51 AM

          Subject: Re: Giraph job hangs indefinitely and is
          eventually killed by JobTracker

          

          
          
            Hi Avery,

            
            

            
            I tried both the options. It does appear to be a GC
              problem. The problem continues with the second option as
              well :(. I have attached the logs after enabling the first
              set of options and using 1 worker. Would be very helpful
              if you can take a look. 
            

            
            This machine has 1 TB memory. We ran benchmarks of
              various other graph libraries on this machine and they
              worked fine (even with graphs 10x larger than the Giraph
              PageRank benchmark - 40 million nodes). I am sure Giraph
              would work fine as well - this should not be a resource
              constraint.  
            

            
            Thanks,
            Vikesh Khanna,

              Masters, Computer Science (Class of 2015)

              Stanford University

              

            
            

            
            
            From:
              "Avery Ching" <[email protected]>

              To: [email protected]

              Sent: Thursday, April 3, 2014 7:26:56 PM

              Subject: Re: Giraph job hangs indefinitely and is
              eventually killed by JobTracker

              

              
              This is for a single worker
                it appears.  Most likely your worker went into GC and
                never returned.  You can try with GC settings turned on,
                try adding something like.

                

                -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails
                -XX:+PrintGCTimeStamps -verbose:gc 

                

                You could also try the concurrent mark/sweep collector. 
                

                

                -XX:+UseConcMarkSweepGC

                

                Any chance you can use more workers and/or get more
                memory?

                

                Avery

                

                On 4/3/14, 5:46 PM, Vikesh Khanna wrote:

              
              
                
                  @Avery,

                  
                  

                  
                  Thanks for the help. I checked out the task logs,
                    and turns out there was an exception  "GC overhead
                    limit exceeded" due to which the benchmarks wouldn't
                    even load the vertices. I got around it by
                    increasing the heap size (mapred.child.java.opts) in
                    mapred-site.xml. The benchmark is loading vertices
                    now. However, the job is still getting stuck
                    indefinitely (and eventually killed). I have
                    attached the small log for the map task on 1 worker.
                    Would really appreciate if you can help understand
                    the cause. 
                  

                  
                  Thanks,
                  Vikesh Khanna,

                    Masters, Computer Science (Class of 2015)

                    Stanford University

                    

                  
                  

                  
                  
                  From:

                    "Praveen kumar s.k" <[email protected]>

                    To: [email protected]

                    Sent: Thursday, April 3, 2014 4:40:07 PM

                    Subject: Re: Giraph job hangs indefinitely
                    and is eventually killed by JobTracker

                    

                    
                    You have given -w 30, make sure that that many
                    number of map tasks are

                    configured in your cluster

                    

                    
                    On Thu, Apr 3, 2014 at 6:24 PM, Avery Ching 
<[email protected]>
                    wrote:

                    > My guess is that you don't get your resources.
                     It would be very helpful to

                    > print the master log.  You can find it when the
                    job is running to look at

                    > the Hadoop counters on the job UI page.

                    >

                    > Avery

                    >

                    >

                    > On 4/3/14, 12:49 PM, Vikesh Khanna wrote:

                    >

                    > Hi,

                    >

                    > I am running the PageRank benchmark under
                    giraph-examples from giraph-1.0.0

                    > release. I am using the following command to
                    run the job (as mentioned here)

                    >

                    > vikesh@madmax

                    >
/lfs/madmax/0/vikesh/usr/local/giraph/giraph-examples/src/main/java/org/apache/giraph/examples

                    > $ $HADOOP_HOME/bin/hadoop jar

                    >
$GIRAPH_HOME/giraph-core/target/giraph-1.0.0-for-hadoop-0.20.203.0-jar-with-dependencies.jar

                    > org.apache.giraph.benchmark.PageRankBenchmark
                    -e 1 -s 3 -v -V 50000000 -w 30

                    >

                    >

                    > However, the job gets stuck at map 9% and is
                    eventually killed by the

                    > JobTracker on reaching the mapred.task.timeout
                    (default 10 minutes). I tried

                    > increasing the timeout to a very large value,
                    and the job went on for over 8

                    > hours without completion. I also tried the
                    ShortestPathsBenchmark, which

                    > also fails the same way.

                    >

                    >

                    > Any help is appreciated.

                    >

                    >

                    > ****** ---------------- ***********

                    >

                    >

                    > Machine details:

                    >

                    > Linux version 2.6.32-279.14.1.el6.x86_64

                    > ([email protected])
                    (gcc version 4.4.6 20120305 (Red Hat

                    > 4.4.6-4) (GCC) ) #1 SMP Tue Nov 6 23:43:09 UTC
                    2012

                    >

                    > Architecture: x86_64

                    > CPU op-mode(s): 32-bit, 64-bit

                    > Byte Order: Little Endian

                    > CPU(s): 64

                    > On-line CPU(s) list: 0-63

                    > Thread(s) per core: 1

                    > Core(s) per socket: 8

                    > CPU socket(s): 8

                    > NUMA node(s): 8

                    > Vendor ID: GenuineIntel

                    > CPU family: 6

                    > Model: 47

                    > Stepping: 2

                    > CPU MHz: 1064.000

                    > BogoMIPS: 5333.20

                    > Virtualization: VT-x

                    > L1d cache: 32K

                    > L1i cache: 32K

                    > L2 cache: 256K

                    > L3 cache: 24576K

                    > NUMA node0 CPU(s): 1-8

                    > NUMA node1 CPU(s): 9-16

                    > NUMA node2 CPU(s): 17-24

                    > NUMA node3 CPU(s): 25-32

                    > NUMA node4 CPU(s): 0,33-39

                    > NUMA node5 CPU(s): 40-47

                    > NUMA node6 CPU(s): 48-55

                    > NUMA node7 CPU(s): 56-63

                    >

                    >

                    > I am using a pseudo-distributed Hadoop cluster
                    on a single machine with

                    > 64-cores.

                    >

                    >

                    > *****-------------*******

                    >

                    >

                    > Thanks,

                    > Vikesh Khanna,

                    > Masters, Computer Science (Class of 2015)

                    > Stanford University

                    >

                    >

                    >

RE: [Solved] Giraph job hangs indefinitely and is eventually killed by JobTracker

Reply via email to