Hello,

I’m trying to run Giraph job with 180092160 vertex on a 18 nodes 440G memory 
cluster. I used 144 workers with default partitioning. However, my job is 
always killed after superstep 0 with error as following:

2015-05-22 05:20:57,668 ERROR [org.apache.giraph.master.MasterThread] 
org.apache.giraph.master.BspServiceMaster: barrierOnWorkerList: Missing chosen 
workers [Worker(hostname=bespin05.umiacs.umd.edu 
<http://bespin05.umiacs.umd.edu/>, MRtaskID=2, port=30002), 
Worker(hostname=bespin04d.umiacs.umd.edu <http://bespin04d.umiacs.umd.edu/>, 
MRtaskID=6, port=30006), Worker(hostname=bespin03a.umiacs.umd.edu 
<http://bespin03a.umiacs.umd.edu/>, MRtaskID=14, port=30014)] on superstep 0
2015-05-22 05:20:57,668 INFO [org.apache.giraph.master.MasterThread] 
org.apache.giraph.master.MasterThread: masterThread: Coordination of superstep 
0 took 77.624 seconds ended with state WORKER_FAILURE and is now on superstep 0
2015-05-22 05:20:57,673 FATAL [org.apache.giraph.master.MasterThread] 
org.apache.giraph.master.BspServiceMaster: getLastGoodCheckpoint: No last good 
checkpoints can be found, killing the job.
java.io.FileNotFoundException: File 
hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015
 
<hdfs://bespinrm.umiacs.umd.edu:8020/user/hlan/_bsp/_checkpoints/job_1432262104001_0015>
 does not exist.
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:658)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:104)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:716)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
        at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:712)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
        at 
org.apache.giraph.utils.CheckpointingUtils.getLastCheckpointedSuperstep(CheckpointingUtils.java:106)
        at 
org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep(BspService.java:1196)
        at 
org.apache.giraph.master.BspServiceMaster.getLastGoodCheckpoint(BspServiceMaster.java:1289)
        at org.apache.giraph.master.MasterThread.run(MasterThread.java:148)

This job works ok with customized partitioning with 144 workers and each worker 
partitioned in 144/72/180 by vertex id.

Also, default partitioning some job with 100051200 vertex input works good too.

Anyone could help?

Many thanks

Best wishes

Hai

Reply via email to