I also had the same issues when I used the out-of-core features, even for trivial datasets, when I used the 1.0.0-RC3 branch. The job would seem to finish all supersteps, but it would hang during the final output of data to HDFS. I found that if I used the latest code in trunk instead (which required some rewriting to match the new interface), then my jobs would finish fine.
On Mon, Oct 14, 2013 at 11:13 AM, Jyotirmoy Sundi <[email protected]>wrote: > Hi folks, > We are successfully able to run Giraph for 1B vertices and > around 20B edges in our cluster. This is great. But when we run it over 5B > vertices over the actual data and around 50B edges we see some issues in > the final step while offloading the partitions. Since the dataset is huge > for our cluster, we are using giraph.useOutOfCoreGraph and > giraph.useOutOfCoreMessages > to spill the data when overloaded.With this setup all the supersteps > finished within around 4 hours. But in the final step after reporting > saving vertices in task status, it hangs after writing a few partitions, it > is happening consistently in our case. I played with all the config > params and nothing is helping out, any suggestions from you will be really > helpful. Thanks a lot. > > The log snippet: > > 2013-10-14 10:24:20,144 INFO org.apache.giraph.worker.BspServiceWorker: > saveVertices: Starting to save 26146422 vertices > 2013-10-14 10:24:20,183 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 1922 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-1922_vertices > 2013-10-14 10:24:20,307 WARN org.apache.giraph.bsp.BspService: process: > Unknown and unprocessed event > (path=/_hadoopBsp/job_201310130212_0013/_applicationAttemptsDir/0/_superstepDir/15/_addressesAndPartitions, > type=NodeDeleted, state=SyncConnected) > 2013-10-14 10:24:20,431 WARN org.apache.giraph.bsp.BspService: process: > Unknown and unprocessed event > (path=/_hadoopBsp/job_201310130212_0013/_applicationAttemptsDir/0/_superstepDir/15/_superstepFinished, > type=NodeDeleted, state=SyncConnected) > 2013-10-14 10:24:20,555 INFO org.apache.giraph.worker.BspServiceWorker: > processEvent: Job state changed, checking to see if it needs to restart > 2013-10-14 10:24:20,640 INFO org.apache.giraph.bsp.BspService: getJobState: > Job state already exists (/_hadoopBsp/job_201310130212_0013/_masterJobState) > 2013-10-14 10:24:22,928 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 13762 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-13762_vertices > 2013-10-14 10:24:27,648 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 23682 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-23682_vertices > 2013-10-14 10:24:30,557 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 14882 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-14882_vertices > 2013-10-14 10:24:32,935 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 11842 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-11842_vertices > 2013-10-14 10:24:33,714 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 962 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-962_vertices > 2013-10-14 10:24:35,184 INFO org.apache.giraph.worker.BspServiceWorker: > saveVertices: Saved 978047 out of 26146422 vertices, on partition 5 out of 160 > 2013-10-14 10:24:35,187 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 22722 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-22722_vertices > 2013-10-14 10:24:37,276 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 21762 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-21762_vertices > 2013-10-14 10:24:39,868 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 11362 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-11362_vertices > 2013-10-14 10:24:41,391 INFO > org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: > writing partition vertices 482 to > /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-482_vertices > > ------------------------------ > > > *The error show in the job failure page for each attempt* > > > > FAILED > > > Task attempt_201310130212_0013_m_000001_0 failed to report status for 7200 > seconds. Killing! > > > -- > Best Regards, > Jyotirmoy Sundi > Data Engineer, > Admobius > > San Francisco, CA 94158 >
