Hello, I’m trying to run Giraph job with 180092160 vertex on a 18 nodes 440G memory cluster. I used 144 workers with default partitioning. However, my job is always killed after superstep 0 with error as following:
2015-05-22 05:20:57,668 ERROR [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: Missing chosen workers [Worker(hostname=bespin05.umiacs.umd.edu <http://bespin05.umiacs.umd.edu/>, MRtaskID=2, port=30002), Worker(hostname=bespin04d.umiacs.umd.edu <http://bespin04d.umiacs.umd.edu/>, MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu <http://bespin03a.umiacs.umd.edu/>, MRtaskID=14, port=30014)] on superstep 0 2015-05-22 05:20:57,668 INFO [org.apache.giraph.master.MasterThread] org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 0 took 77.624 seconds ended with state WORKER_FAILURE and is now on superstep 0 2015-05-22 05:20:57,673 FATAL [org.apache.giraph.master.MasterThread] org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No last good checkpoints can be found, killing the job. java.io.FileNotFoundException: File hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015 <hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015> does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:658) at org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716) at org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485) at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525) at org.apache.giraph.utils.CheckpointingUtils.getLastCheckpointedSuperstep(CheckpointingUtils.java:106) at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(BspService.java:1196) at org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1289) at org.apache.giraph.master.MasterThread.run(MasterThread.java:148) This job works ok with customized partitioning with 144 workers and each worker partitioned in 144/72/180 by vertex id. Also, default partitioning some job with 100051200 vertex input works good too. Anyone could help? Many thanks Best wishes Hai
