The job eventually fails because of out-of-hearbeat from tasks: Task attempt_201309262131_0001_m_000058_0 failed to report status for 602 seconds. Killing!
Any suggestions? Yingyi On Thu, Sep 26, 2013 at 3:27 PM, Yingyi Bu <[email protected]> wrote: > I checked several other slave machines. > Basically the map task is waiting on this trace: > > "main" prio=10 tid=0x00000000098ed000 nid=0x7beb in Object.wait() > [0x00000000413e7000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x0000000400108530> (a > java.util.concurrent.ConcurrentHashMap) > at > org.apache.giraph.comm.netty.NettyClient.waitSomeRequests(NettyClient.java:690) > - locked <0x0000000400108530> (a > java.util.concurrent.ConcurrentHashMap) > at > org.apache.giraph.comm.netty.NettyClient.waitAllRequests(NettyClient.java:666) > at > org.apache.giraph.comm.netty.NettyWorkerClient.waitAllRequests(NettyWorkerClient.java:149) > at > org.apache.giraph.worker.BspServiceWorker.waitForRequestsToFinish(BspServiceWorker.java:829) > at > org.apache.giraph.worker.BspServiceWorker.finishSuperstep(BspServiceWorker.java:743) > at > org.apache.giraph.graph.GraphTaskManager.completeSuperstepAndCollectStats(GraphTaskManager.java:387) > at > org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:276) > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92) > > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) > at org.apache.hadoop.mapred.Child$4.run(Child.java:259) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:253) > > Is it because I miss some setting? > > Yingyi > > > > On Thu, Sep 26, 2013 at 3:16 PM, Yingyi Bu <[email protected]> wrote: > >> I have 61 slave machines. Each slave machine has 16GB memory and 4 cores. >> >> I tried two configurations: >> 1. Let mapred.map.child.java.opts to be -Xmx4g, and run the job with 4 >> workers per machine on average (-w 240, try to use all the cores). >> 2. Let mapred.map.child.java.opts to be -Xmx16g, and run the job with 1 >> worker per machine on average (-w 60). >> >> I used the combiner. >> Here are the behaviors of the two configurations: >> 1. Configuration 1 fails with OutOfMemoryError--GC limit exceeds during >> superstep -1. >> 2. Configuration 2 can finish superstep -1 but hang at superstep 0 for a >> long time (more than 40 minutes). The status of each map task is >> "startSuperstep: WORKER_ONLY - Attempt=0, Superstep=0". I checked several >> slave machines -- the CPU is not used. Attached is the dumped stacktrace. >> Does any one have experience with similar situations? >> >> Another question is: how can I effectively use all the cores in slave >> machines? Does each worker do multi-threading? >> Thanks a lot! >> >> Yingyi >> >> >> >> On Thu, Sep 26, 2013 at 1:08 PM, Avery Ching <[email protected]> wrote: >> >>> Hopefully you are using combiners and also re-using objects. This can >>> keep memory usage much lower. Also implementing your own OutEdges can make >>> it much more efficient. >>> >>> How much memory do you have? >>> >>> Avery >>> >>> >>> On 9/26/13 12:51 PM, Yingyi Bu wrote: >>> >>> >> I think you may have added the same vertex 2x? >>> I ran the job over roughly half of the graph and saw this. However the >>> input is not a connected components such that there might be target vertex >>> ids which do not exist. >>> When I ran the job over the entire graph, I cannot see this but the job >>> fails with exceeding GC limit (trying out-of-core now). >>> >>> Yingyi >>> >>> >>> >>> On Thu, Sep 26, 2013 at 12:05 PM, Avery Ching <[email protected]> wrote: >>> >>>> I think you may have added the same vertex 2x? That being said, I >>>> don't see why the code is this way. It should be fine. We should file a >>>> JIRA. >>>> >>>> >>>> On 9/26/13 11:02 AM, Yingyi Bu wrote: >>>> >>>> Thanks, Lukas! >>>> I think the reason of this exception is that I run the job over part >>>> of the graph where some target ids do not exist. >>>> >>>> Yingyi >>>> >>>> >>>> On Thu, Sep 26, 2013 at 1:13 AM, Lukas Nalezenec < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> Do you use partition balancing ? >>>>> Lukas >>>>> >>>>> >>>>> >>>>> On 09/26/13 05:16, Yingyi Bu wrote: >>>>> >>>>> Hi, >>>>> >>>>> I got this exception when I ran a Giraph-1.0.0 PageRank job over a 60 >>>>> machine cluster with 28GB input data. But I got this exception: >>>>> >>>>> java.lang.IllegalStateException: run: Caught an unrecoverable exception >>>>> resolveMutations: Already has missing vertex on this worker for 20464109 >>>>> at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:102) >>>>> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) >>>>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) >>>>> at org.apache.hadoop.mapred.Child$4.run(Child.java:259) >>>>> at java.security.AccessController.doPrivileged(Native Method) >>>>> at javax.security.auth.Subject.doAs(Subject.java:415) >>>>> at >>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >>>>> at org.apache.hadoop.mapred.Child.main(Child.java:253) >>>>> Caused by: java.lang.IllegalStateException: resolveMutations: Already has >>>>> missing vertex on this worker for 20464109 >>>>> at >>>>> org.apache.giraph.comm.netty.NettyWorkerServer.resolveMutations(NettyWorkerServer.java:184) >>>>> at >>>>> org.apache.giraph.comm.netty.NettyWorkerServer.prepareSuperstep(NettyWorkerServer.java:152) >>>>> at >>>>> org.apache.giraph.worker.BspServiceWorker.startSuperstep(BspServiceWorker.java:677) >>>>> at >>>>> org.apache.giraph.graph.GraphTaskManager.execute(GraphTaskManager.java:249) >>>>> at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:92) >>>>> ... 7 more >>>>> >>>>> >>>>> >>>>> Does anyone know what is the possible cause of this exception? >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> Yingyi >>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >
