Hi Denis, For the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error I hope that the conversation in below link can help you. www.mail-archive.com/user@giraph.apache.org/msg02938.html
Regards, Xenia 2016-11-08 12:25 GMT+02:00 Denis Dudinski <denis.dudin...@gmail.com>: > Hi Hassan, > > Thank you for really quick response! > > I changed "giraph.isStaticGraph" to false and the error disappeared. > As expected iteration became slow and wrote to disk edges once again > in superstep 1. > > However, the computation failed at superstep 2 with error > "java.lang.OutOfMemoryError: GC overhead limit exceeded". It seems to > be unrelated to "isStaticGraph" issue, but I think it worth mentioning > to see the picture as a whole. > > Are there any other tests/information I am able to execute/check to > help to pinpoint "isStaticGraph" problem? > > Best Regards, > Denis Dudinski > > > 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.esl...@gmail.com>: > > Hi Denis, > > > > Thanks for bringing up the issue. In the previous conversation thread, > the > > similar problem is reported even with a simpler example connected > component > > calculation. Although, back then, we were developing other > > performance-critical components of OOC. > > > > Let's debug this issue together to make the new OOC more stable. I > suspect > > the problem is with "giraph.isStaticGraph=true" (as this is only an > > optimization and most of our end-to-end testing was on cases where the > graph > > could change). Let's get rid of it for now and see if the problem still > > exists. > > > > Best, > > Hassan > > > > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski <denis.dudin...@gmail.com > > > > wrote: > >> > >> Hello, > >> > >> We are trying to calculate PageRank on huge graph, which does not fit > >> into memory. For calculation to succeed we tried to turn on OutOfCore > >> feature of Giraph, but every launch we tried resulted in > >> com.esotericsoftware.kryo.KryoException: Buffer underflow. > >> Each time it happens on different servers but exactly right after > >> start of superstep 1. > >> > >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't > >> back-step to Giraph's officially supported version and had to patch > >> Giraph a little) placed on 11 servers + 3 master servers (namenodes > >> etc.) with separate ZooKeeper cluster deployment. > >> > >> Our launch command: > >> > >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar > >> org.apache.giraph.GiraphRunner com.prototype.di.pr.PageRankComputation > >> \ > >> -mc com.prototype.di.pr.PageRankMasterCompute \ > >> -yj pr-job-jar-with-dependencies.jar \ > >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \ > >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \ > >> -op /user/hadoop/output/pr_test \ > >> -w 10 \ > >> -c com.prototype.di.pr.PRDoubleCombiner \ > >> -wc com.prototype.di.pr.PageRankWorkerContext \ > >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \ > >> -ca giraph.logLevel=info \ > >> -ca hbase.mapreduce.inputtable=di_test \ > >> -ca hbase.mapreduce.scan.columns=di:n \ > >> -ca hbase.defaults.for.version.skip=true \ > >> -ca hbase.table.row.textkey=false \ > >> -ca giraph.yarn.task.heap.mb=48000 \ > >> -ca giraph.isStaticGraph=true \ > >> -ca giraph.SplitMasterWorker=false \ > >> -ca giraph.oneToAllMsgSending=true \ > >> -ca giraph.metrics.enable=true \ > >> -ca giraph.jmap.histo.enable=true \ > >> -ca giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable > \ > >> -ca giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable > \ > >> -ca giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArrayEdges > \ > >> -ca giraph.useOutOfCoreGraph=true \ > >> -ca giraph.waitForPerWorkerRequests=true \ > >> -ca giraph.maxNumberOfUnsentRequests=1000 \ > >> -ca > >> giraph.vertexInputFilterClass=com.prototype.di.pr.input. > PagesFromSameDomainLimiter > >> \ > >> -ca giraph.useInputSplitLocality=true \ > >> -ca hbase.mapreduce.scan.cachedrows=10000 \ > >> -ca giraph.minPartitionsPerComputeThread=60 \ > >> -ca > >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr. > DomainAwareGraphPartitionerFactory > >> \ > >> -ca giraph.numInputThreads=1 \ > >> -ca giraph.inputSplitSamplePercent=20 \ > >> -ca giraph.pr.maxNeighborsPerVertex=50 \ > >> -ca giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition > \ > >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \ > >> -ca > >> giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/ > _bsp/_partitions > >> > >> Logs excerpt: > >> > >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in worker > >> context > >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions > >> to process with 1 compute thread(s), originally 1 thread(s) on > >> superstep 1 > >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60 > >> partitions in memory and 1 active threads > >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR > >> computation > >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.75 > >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine: > >> updateActiveThreadsFraction: updating the number of active threads to > >> 1 > >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle: > >> updateRequestsCredit: updating the credit to 20 > >> 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring: > >> name = PS Scavenge, action = end of minor GC, cause = Allocation > >> Failure, duration = 937ms > >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.72 > >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.74 > >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine: > >> updateActiveThreadsFraction: updating the number of active threads to > >> 1 > >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: > >> updateRequestsCredit: updating the credit to 20 > >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.76 > >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition: > >> processing partition 234 is done! > >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.79 > >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine: > >> updateActiveThreadsFraction: updating the number of active threads to > >> 1 > >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: > >> updateRequestsCredit: updating the credit to 18 > >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window > >> metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave > >> received req MBytes = 0.0138, secs waited = 30.058 > >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.82 > >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > >> IO command is: StorePartitionIOCommand: (partitionId = 234) > >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's > >> command StorePartitionIOCommand: (partitionId = 234) completed: bytes= > >> 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC > >> time)=175.03 > >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.83 > >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > >> IO command is: StoreIncomingMessageIOCommand: (partitionId = 234) > >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's > >> command StoreIncomingMessageIOCommand: (partitionId = 234) completed: > >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC time)=NaN > >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.83 > >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: > >> name = PS Scavenge, action = end of minor GC, cause = Allocation > >> Failure, duration = 3107ms > >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring: > >> name = PS MarkSweep, action = end of major GC, cause = Ergonomics, > >> duration = 15064ms > >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine: > >> updateActiveThreadsFraction: updating the number of active threads to > >> 1 > >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: > >> updateRequestsCredit: updating the credit to 20 > >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions: > >> usedMemoryFraction = 0.71 > >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next > >> IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep = > >> 2) > >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016 > >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition: > >> processing partition 364 is done! > >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine: > >> updateActiveThreadsFraction: updating the number of active threads to > >> 1 > >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle: > >> updateRequestsCredit: updating the credit to 20 > >> -- > >> -- num #instances #bytes class name > >> -- ---------------------------------------------- > >> -- 1: 224004229 10752202992 > >> java.util.concurrent.ConcurrentHashMap$Node > >> -- 2: 19751666 6645730528 [B > >> -- 3: 222135985 5331263640 > >> com.belprime.di.pr.DomainPartAwareLongWritable > >> -- 4: 214686483 5152475592 org.apache.hadoop.io. > DoubleWritable > >> -- 5: 353 4357261784 > >> [Ljava.util.concurrent.ConcurrentHashMap$Node; > >> -- 6: 486266 204484688 [I > >> -- 7: 6017652 192564864 > >> org.apache.giraph.utils.UnsafeByteArrayOutputStream > >> -- 8: 3986203 159448120 > >> org.apache.giraph.utils.UnsafeByteArrayInputStream > >> -- 9: 2064182 148621104 > >> org.apache.giraph.graph.ByteValueVertex > >> -- 10: 2064182 82567280 > >> org.apache.giraph.edge.ByteArrayEdges > >> -- 11: 1886875 45285000 java.lang.Integer > >> -- 12: 349409 30747992 > >> java.util.concurrent.ConcurrentHashMap$TreeNode > >> -- 13: 916970 29343040 java.util.Collections$1 > >> -- 14: 916971 22007304 java.util.Collections$ > SingletonSet > >> -- 15: 47270 3781600 > >> java.util.concurrent.ConcurrentHashMap$TreeBin > >> -- 16: 26201 2590912 [C > >> -- 17: 34175 1367000 > >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator > >> -- 18: 6143 1067704 java.lang.Class > >> -- 19: 25953 830496 java.lang.String > >> -- 20: 34175 820200 org.apache.giraph.edge. > EdgeNoValue > >> -- 21: 4488 703400 [Ljava.lang.Object; > >> -- 22: 70 395424 [Ljava.nio.channels.SelectionKey; > >> -- 23: 2052 328320 java.lang.reflect.Method > >> -- 24: 6600 316800 > >> org.apache.giraph.utils.ByteArrayVertexIdMessages > >> -- 25: 5781 277488 java.util.HashMap$Node > >> -- 26: 5651 271248 java.util.Hashtable$Entry > >> -- 27: 6604 211328 > >> org.apache.giraph.factories.DefaultMessageValueFactory > >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of > >> callable failed > >> java.lang.RuntimeException: call: execution of IO command > >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! > >> at > >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:115) > >> at > >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:36) > >> at > >> org.apache.giraph.utils.LogStacktraceCallable.call( > LogStacktraceCallable.java:67) > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> at > >> java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > >> at > >> java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > >> at java.lang.Thread.run(Thread.java:745) > >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. > >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199) > >> at com.esotericsoftware.kryo.io.UnsafeInput.readLong( > UnsafeInput.java:112) > >> at > >> com.esotericsoftware.kryo.io.KryoDataInput.readLong( > KryoDataInput.java:91) > >> at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47) > >> at > >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges( > DiskBackedPartitionStore.java:245) > >> at > >> org.apache.giraph.ooc.data.DiskBackedPartitionStore. > loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) > >> at > >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy( > DiskBackedDataStore.java:234) > >> at > >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData( > DiskBackedPartitionStore.java:311) > >> at > >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute( > LoadPartitionIOCommand.java:66) > >> at > >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:99) > >> ... 6 more > >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException: > >> OverrideExceptionHandler on thread ooc-io-0, msg = call: execution of > >> IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2) > >> failed!, exiting... > >> java.lang.RuntimeException: call: execution of IO command > >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed! > >> at > >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:115) > >> at > >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:36) > >> at > >> org.apache.giraph.utils.LogStacktraceCallable.call( > LogStacktraceCallable.java:67) > >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> at > >> java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > >> at > >> java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > >> at java.lang.Thread.run(Thread.java:745) > >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow. > >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199) > >> at com.esotericsoftware.kryo.io.UnsafeInput.readLong( > UnsafeInput.java:112) > >> at > >> com.esotericsoftware.kryo.io.KryoDataInput.readLong( > KryoDataInput.java:91) > >> at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47) > >> at > >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges( > DiskBackedPartitionStore.java:245) > >> at > >> org.apache.giraph.ooc.data.DiskBackedPartitionStore. > loadInMemoryPartitionData(DiskBackedPartitionStore.java:278) > >> at > >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy( > DiskBackedDataStore.java:234) > >> at > >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData( > DiskBackedPartitionStore.java:311) > >> at > >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute( > LoadPartitionIOCommand.java:66) > >> at > >> org.apache.giraph.ooc.OutOfCoreIOCallable.call( > OutOfCoreIOCallable.java:99) > >> ... 6 more > >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth: Got > >> failure, unregistering health on > >> > >> /_hadoopBsp/giraph_yarn_application_1478342673283_ > 0009/_applicationAttemptsDir/0/_superstepDir/1/_ > workerHealthyDir/datanode6.webmeup.com_5 > >> on superstep 1 > >> > >> We looked into one thread > >> > >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/% > 3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT% > 3DoLJAAPQw%40mail.gmail.com%3E > >> but it is rather old and at that time the answer was "do not use it > >> yet". > >> (see reply > >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/% > 3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw% > 40mail.gmail.com%3E). > >> Does it hold today? We would like to use new advanced adaptive OOC > >> approach if possible... > >> > >> Thank you in advance, any help or hint would be really appreciated. > >> > >> Best Regards, > >> Denis Dudinski > > > > >