Hi Kristen, thanks for posting this. During the port to YARN I encountered some race problems with the output sequence. The YARN implementation has to handle this a bit differently than the non-YARN and although we got it figured out at the time, I haven't really looked at it in many months and non-YARN Giraph has evolved quickly since then. Wouldn't shock me if there is trouble here, if I recall the solution seemed a bit delicate.
If you have some ideas for a patch I'd be happy to review, I am pretty strapped for time right now but if you post a ticket to the Giraph JIRA and no one else attempts a patch I'm sure either myself or Mohammed will take a swipe at it eventually. Thanks! Eli On Mon, Jan 20, 2014 at 9:01 AM, Kristen Hardwick <[email protected]>wrote: > Sorry to bug everyone again, but does anyone have any ideas on this? > Please let me know if I'm leaving out any crucial information that could > get me some help. > > Thanks! > Kristen > > > On Mon, Jan 13, 2014 at 5:48 PM, Kristen Hardwick > <[email protected]>wrote: > >> Hi all, >> >> I had a very productive day today getting this stuff figured out. >> Unfortunately, it appears that I've stumbled onto a possible race condition >> during the cleanup step of the code for the application. >> >> I put some information here that explains why I think it is a race >> condition. http://pastebin.com/Qswb98dq Basically, I tried the exact >> same command twice, making no other changes - the first time it failed and >> the second time it succeeded. >> >> This makes me think that the LeaseExpiredException/DataStreamerException >> is caused because the files have been cleaned up just before they are >> needed. Possibly inside the BspServiceMaster, but I am not at all sure >> about that. >> >> Is anyone already aware of this? Should I log it as a bug? I do have >> access to (DEBUG) logs of both the successful and failed attempts if anyone >> wants to see them. >> >> Thanks, >> Kristen Hardwick >> >> >> On Mon, Jan 13, 2014 at 11:03 AM, Kristen Hardwick <[email protected] >> > wrote: >> >>> Hi Avery (or anyone else that knows), >>> >>> Could you please give me some details that would help me find the past >>> threads that might address this issue? I searched Google with various >>> combinations of "giraph datastreamer exception yarn lease expired >>> zookeeper" and didn't really come up with anything that seemed relevant. >>> >>> Is it possible that it's just a memory issue on my end? I'm running >>> inside a VM - a single node cluster with 8 GB of memory allocated to it. >>> Could that have anything to do with it? Right now I'm investigating the >>> code to try to lower the amount of memory allocated to the containers. >>> >>> Thanks, >>> Kristen >>> >>> >>> On Fri, Jan 10, 2014 at 8:45 PM, Avery Ching <[email protected]> wrote: >>> >>>> This looks more like the Zookeeper/YARN issues mentioned in the >>>> past. Unfortunately, I do not have a YARN instance to test this with. >>>> Does anyone else have any insights here? >>>> >>>> >>>> On 1/10/14 1:48 PM, Kristen Hardwick wrote: >>>> >>>> Hi all, I'm requesting help again! I'm trying to get this >>>> SimpleShortestPathsComputation example working, but I'm stuck again. Now >>>> the job begins to run and seems to work until the final step (it performs 3 >>>> supersteps), but the overall job is failing. >>>> >>>> In the master, among other things, I see: >>>> >>>> ... >>>> 14/01/10 15:04:17 INFO master.MasterThread: setup: Took 0.87 seconds. >>>> 14/01/10 15:04:17 INFO master.MasterThread: input superstep: Took 0.708 >>>> seconds. >>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 0: Took 0.158 >>>> seconds. >>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 1: Took 0.344 >>>> seconds. >>>> 14/01/10 15:04:17 INFO master.MasterThread: superstep 2: Took 0.064 >>>> seconds. >>>> 14/01/10 15:04:17 INFO master.MasterThread: shutdown: Took 0.162 >>>> seconds. >>>> 14/01/10 15:04:17 INFO master.MasterThread: total: Took 2.31 seconds. >>>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: Master is ready to commit >>>> final job output data. >>>> 14/01/10 15:04:18 INFO yarn.GiraphYarnTask: Master has committed the >>>> final job output data. >>>> ... >>>> >>>> To me, that looks promising - like the job was successful. However, >>>> in the WORKER_ONLY containers, I see these things: >>>> >>>> ... >>>> 14/01/10 15:04:17 INFO graph.GraphTaskManager: cleanup: Starting for >>>> WORKER_ONLY >>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed >>>> event >>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_addressesAndPartitions, >>>> type=NodeDeleted, state=SyncConnected) >>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent : >>>> partitionExchangeChildrenChanged (at least one worker is done sending >>>> partitions) >>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed >>>> event >>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/1/_superstepFinished, >>>> type=NodeDeleted, state=SyncConnected) >>>> 14/01/10 15:04:17 INFO netty.NettyClient: stop: reached wait threshold, >>>> 1 connections closed, releasing NettyClient.bootstrap resources now. >>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state >>>> changed, checking to see if it needs to restart >>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already >>>> exists >>>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState) >>>> 14/01/10 15:04:17 INFO yarn.GiraphYarnTask: [STATUS: task-1] >>>> saveVertices: Starting to save 2 vertices using 1 threads >>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: saveVertices: Starting >>>> to save 2 vertices using 1 threads >>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent: Job state >>>> changed, checking to see if it needs to restart >>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state already >>>> exists >>>> (/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState) >>>> 14/01/10 15:04:17 INFO bsp.BspService: getJobState: Job state path is >>>> empty! - >>>> /_hadoopBsp/giraph_yarn_application_1389300168420_0024/_masterJobState >>>> 14/01/10 15:04:17 ERROR zookeeper.ClientCnxn: Error while calling >>>> watcher >>>> java.lang.NullPointerException >>>> at java.io.StringReader.<init>(StringReader.java:50) >>>> at org.json.JSONTokener.<init>(JSONTokener.java:66) >>>> at org.json.JSONObject.<init>(JSONObject.java:402) >>>> at >>>> org.apache.giraph.bsp.BspService.getJobState(BspService.java:716) >>>> at >>>> org.apache.giraph.worker.BspServiceWorker.processEvent(BspServiceWorker.java:1563) >>>> at >>>> org.apache.giraph.bsp.BspService.process(BspService.java:1095) >>>> at >>>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:519) >>>> at >>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed >>>> event >>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_vertexInputSplitsAllReady, >>>> type=NodeDeleted, state=SyncConnected) >>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and >>>> unprocessed event >>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_addressesAndPartitions, >>>> type=NodeDeleted, state=SyncConnected) >>>> 14/01/10 15:04:17 INFO worker.BspServiceWorker: processEvent : >>>> partitionExchangeChildrenChanged (at least one worker is done sending >>>> partitions) >>>> 14/01/10 15:04:17 WARN bsp.BspService: process: Unknown and unprocessed >>>> event >>>> (path=/_hadoopBsp/giraph_yarn_application_1389300168420_0024/_applicationAttemptsDir/0/_superstepDir/2/_superstepFinished, >>>> type=NodeDeleted, state=SyncConnected) >>>> ... >>>> 14/01/10 15:04:17 WARN hdfs.DFSClient: DataStreamer Exception >>>> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): >>>> No lease on >>>> /user/spry/Shortest/_temporary/1/_temporary/attempt_1389300168420_0024_m_000001_1/part-m-00001: >>>> File does not exist. Holder DFSClient_NONMAPREDUCE_-643344145_1 does not >>>> have any open files. >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2755) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:2567) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2480) >>>> at >>>> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) >>>> at >>>> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387) >>>> ... >>>> >>>> I apologize for the wall of error message, but I tried to leave in at >>>> least some of the parts that might be useful. I put the entire YARN log >>>> here: http://tny.cz/af229738 >>>> >>>> Has anyone ever seen this before? This is the command I'm using to >>>> run: >>>> >>>> hadoop jar >>>> giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-2.2.0-jar-with-dependencies.jar >>>> org.apache.giraph.GiraphRunner -Dgiraph.SplitMasterWorker=false >>>> -Dgiraph.zkList="localhost:2181" -Dgiraph.zkSessionMsecTimeout=600000 >>>> -Dgiraph.useInputSplitLocality=false >>>> org.apache.giraph.examples.SimpleShortestPathsComputation -vif >>>> org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat >>>> -vip /user/spry/input -vof >>>> org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op >>>> /user/spry/Shortest -w 1 >>>> >>>> My setup is still the same as the other email if you saw it: >>>> >>>> I compiled Giraph with this command, and everything built >>>> successfully except "Apache Giraph Distribution" which it doesn't seem like >>>> I need: >>>> >>>> mvn -Phadoop_yarn -Dhadoop.version=2.2.0 -DskipTests clean package >>>> >>>> I am running with the following components: >>>> >>>> Single node cluster >>>> Giraph 1.1 >>>> Hadoop 2.2.0 (Hortonworks) >>>> Java 1.7.0_45 >>>> >>>> Thanks in advance, >>>> -Kristen Hardwick >>>> >>>> >>>> >>> >> >
