Hi Denis,

For the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error  I
hope that the  conversation in below link can help you.
 www.mail-archive.com/user@giraph.apache.org/msg02938.html

Regards,
Xenia

2016-11-08 12:25 GMT+02:00 Denis Dudinski <denis.dudin...@gmail.com>:

> Hi Hassan,
>
> Thank you for really quick response!
>
> I changed "giraph.isStaticGraph" to false and the error disappeared.
> As expected iteration became slow and wrote to disk edges once again
> in superstep 1.
>
> However, the computation failed at superstep 2 with error
> "java.lang.OutOfMemoryError: GC overhead limit exceeded". It seems to
> be unrelated to "isStaticGraph" issue, but I think it worth mentioning
> to see the picture as a whole.
>
> Are there any other tests/information I am able to execute/check to
> help to pinpoint "isStaticGraph" problem?
>
> Best Regards,
> Denis Dudinski
>
>
> 2016-11-07 20:00 GMT+03:00 Hassan Eslami <hsn.esl...@gmail.com>:
> > Hi Denis,
> >
> > Thanks for bringing up the issue. In the previous conversation thread,
> the
> > similar problem is reported even with a simpler example connected
> component
> > calculation. Although, back then, we were developing other
> > performance-critical components of OOC.
> >
> > Let's debug this issue together to make the new OOC more stable. I
> suspect
> > the problem is with "giraph.isStaticGraph=true" (as this is only an
> > optimization and most of our end-to-end testing was on cases where the
> graph
> > could change). Let's get rid of it for now and see if the problem still
> > exists.
> >
> > Best,
> > Hassan
> >
> > On Mon, Nov 7, 2016 at 6:24 AM, Denis Dudinski <denis.dudin...@gmail.com
> >
> > wrote:
> >>
> >> Hello,
> >>
> >> We are trying to calculate PageRank on huge graph, which does not fit
> >> into memory. For calculation to succeed we tried to turn on OutOfCore
> >> feature of Giraph, but every launch we tried resulted in
> >> com.esotericsoftware.kryo.KryoException: Buffer underflow.
> >> Each time it happens on different servers but exactly right after
> >> start of superstep 1.
> >>
> >> We are using Giraph 1.2.0 on hadoop 2.7.3 (our prod version, can't
> >> back-step to Giraph's officially supported version and had to patch
> >> Giraph a little) placed on 11 servers + 3 master servers (namenodes
> >> etc.) with separate ZooKeeper cluster deployment.
> >>
> >> Our launch command:
> >>
> >> hadoop jar /opt/giraph-1.2.0/pr-job-jar-with-dependencies.jar
> >> org.apache.giraph.GiraphRunner com.prototype.di.pr.PageRankComputation
> >> \
> >> -mc com.prototype.di.pr.PageRankMasterCompute \
> >> -yj pr-job-jar-with-dependencies.jar \
> >> -vif com.belprime.di.pr.input.HBLongVertexInputFormat \
> >> -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
> >> -op /user/hadoop/output/pr_test \
> >> -w 10 \
> >> -c com.prototype.di.pr.PRDoubleCombiner \
> >> -wc com.prototype.di.pr.PageRankWorkerContext \
> >> -ca hbase.rootdir=hdfs://namenode1.webmeup.com:8020/hbase \
> >> -ca giraph.logLevel=info \
> >> -ca hbase.mapreduce.inputtable=di_test \
> >> -ca hbase.mapreduce.scan.columns=di:n \
> >> -ca hbase.defaults.for.version.skip=true \
> >> -ca hbase.table.row.textkey=false \
> >> -ca giraph.yarn.task.heap.mb=48000 \
> >> -ca giraph.isStaticGraph=true \
> >> -ca giraph.SplitMasterWorker=false \
> >> -ca giraph.oneToAllMsgSending=true \
> >> -ca giraph.metrics.enable=true \
> >> -ca giraph.jmap.histo.enable=true \
> >> -ca giraph.vertexIdClass=com.prototype.di.pr.DomainPartAwareLongWritable
> \
> >> -ca giraph.outgoingMessageValueClass=org.apache.hadoop.io.DoubleWritable
> \
> >> -ca giraph.inputOutEdgesClass=org.apache.giraph.edge.LongNullArrayEdges
> \
> >> -ca giraph.useOutOfCoreGraph=true \
> >> -ca giraph.waitForPerWorkerRequests=true \
> >> -ca giraph.maxNumberOfUnsentRequests=1000 \
> >> -ca
> >> giraph.vertexInputFilterClass=com.prototype.di.pr.input.
> PagesFromSameDomainLimiter
> >> \
> >> -ca giraph.useInputSplitLocality=true \
> >> -ca hbase.mapreduce.scan.cachedrows=10000 \
> >> -ca giraph.minPartitionsPerComputeThread=60 \
> >> -ca
> >> giraph.graphPartitionerFactoryClass=com.prototype.di.pr.
> DomainAwareGraphPartitionerFactory
> >> \
> >> -ca giraph.numInputThreads=1 \
> >> -ca giraph.inputSplitSamplePercent=20 \
> >> -ca giraph.pr.maxNeighborsPerVertex=50 \
> >> -ca giraph.partitionClass=org.apache.giraph.partition.ByteArrayPartition
> \
> >> -ca giraph.vertexClass=org.apache.giraph.graph.ByteValueVertex \
> >> -ca
> >> giraph.partitionsDirectory=/disk1/_bsp/_partitions,/disk2/
> _bsp/_partitions
> >>
> >> Logs excerpt:
> >>
> >> 16/11/06 15:47:15 INFO pr.PageRankWorkerContext: Pre superstep in worker
> >> context
> >> 16/11/06 15:47:15 INFO graph.GraphTaskManager: execute: 60 partitions
> >> to process with 1 compute thread(s), originally 1 thread(s) on
> >> superstep 1
> >> 16/11/06 15:47:15 INFO ooc.OutOfCoreEngine: startIteration: with 60
> >> partitions in memory and 1 active threads
> >> 16/11/06 15:47:15 INFO pr.PageRankComputation: Pre superstep1 in PR
> >> computation
> >> 16/11/06 15:47:15 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.75
> >> 16/11/06 15:47:16 INFO ooc.OutOfCoreEngine:
> >> updateActiveThreadsFraction: updating the number of active threads to
> >> 1
> >> 16/11/06 15:47:16 INFO policy.ThresholdBasedOracle:
> >> updateRequestsCredit: updating the credit to 20
> >> 16/11/06 15:47:17 INFO graph.GraphTaskManager: installGCMonitoring:
> >> name = PS Scavenge, action = end of minor GC, cause = Allocation
> >> Failure, duration = 937ms
> >> 16/11/06 15:47:17 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.72
> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.74
> >> 16/11/06 15:47:18 INFO ooc.OutOfCoreEngine:
> >> updateActiveThreadsFraction: updating the number of active threads to
> >> 1
> >> 16/11/06 15:47:18 INFO policy.ThresholdBasedOracle:
> >> updateRequestsCredit: updating the credit to 20
> >> 16/11/06 15:47:19 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.76
> >> 16/11/06 15:47:19 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
> >> processing partition 234 is done!
> >> 16/11/06 15:47:20 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.79
> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreEngine:
> >> updateActiveThreadsFraction: updating the number of active threads to
> >> 1
> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle:
> >> updateRequestsCredit: updating the credit to 18
> >> 16/11/06 15:47:21 INFO handler.RequestDecoder: decode: Server window
> >> metrics MBytes/sec received = 1.0994, MBytesReceived = 33.0459, ave
> >> received req MBytes = 0.0138, secs waited = 30.058
> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.82
> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> >> IO command is: StorePartitionIOCommand: (partitionId = 234)
> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> command StorePartitionIOCommand: (partitionId = 234) completed: bytes=
> >> 64419740, duration=351, bandwidth=175.03, bandwidth (excluding GC
> >> time)=175.03
> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.83
> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> >> IO command is: StoreIncomingMessageIOCommand: (partitionId = 234)
> >> 16/11/06 15:47:21 INFO ooc.OutOfCoreIOCallable: call: thread 0's
> >> command StoreIncomingMessageIOCommand: (partitionId = 234) completed:
> >> bytes= 0, duration=0, bandwidth=NaN, bandwidth (excluding GC time)=NaN
> >> 16/11/06 15:47:21 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.83
> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
> >> name = PS Scavenge, action = end of minor GC, cause = Allocation
> >> Failure, duration = 3107ms
> >> 16/11/06 15:47:40 INFO graph.GraphTaskManager: installGCMonitoring:
> >> name = PS MarkSweep, action = end of major GC, cause = Ergonomics,
> >> duration = 15064ms
> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreEngine:
> >> updateActiveThreadsFraction: updating the number of active threads to
> >> 1
> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle:
> >> updateRequestsCredit: updating the credit to 20
> >> 16/11/06 15:47:40 INFO policy.ThresholdBasedOracle: getNextIOActions:
> >> usedMemoryFraction = 0.71
> >> 16/11/06 15:47:40 INFO ooc.OutOfCoreIOCallable: call: thread 0's next
> >> IO command is: LoadPartitionIOCommand: (partitionId = 234, superstep =
> >> 2)
> >> JMap histo dump at Sun Nov 06 15:47:41 CET 2016
> >> 16/11/06 15:47:41 INFO ooc.OutOfCoreEngine: doneProcessingPartition:
> >> processing partition 364 is done!
> >> 16/11/06 15:47:48 INFO ooc.OutOfCoreEngine:
> >> updateActiveThreadsFraction: updating the number of active threads to
> >> 1
> >> 16/11/06 15:47:48 INFO policy.ThresholdBasedOracle:
> >> updateRequestsCredit: updating the credit to 20
> >> --
> >> -- num     #instances         #bytes  class name
> >> -- ----------------------------------------------
> >> --   1:     224004229    10752202992
> >> java.util.concurrent.ConcurrentHashMap$Node
> >> --   2:      19751666     6645730528  [B
> >> --   3:     222135985     5331263640
> >> com.belprime.di.pr.DomainPartAwareLongWritable
> >> --   4:     214686483     5152475592  org.apache.hadoop.io.
> DoubleWritable
> >> --   5:           353     4357261784
> >> [Ljava.util.concurrent.ConcurrentHashMap$Node;
> >> --   6:        486266      204484688  [I
> >> --   7:       6017652      192564864
> >> org.apache.giraph.utils.UnsafeByteArrayOutputStream
> >> --   8:       3986203      159448120
> >> org.apache.giraph.utils.UnsafeByteArrayInputStream
> >> --   9:       2064182      148621104
> >> org.apache.giraph.graph.ByteValueVertex
> >> --  10:       2064182       82567280
> >> org.apache.giraph.edge.ByteArrayEdges
> >> --  11:       1886875       45285000  java.lang.Integer
> >> --  12:        349409       30747992
> >> java.util.concurrent.ConcurrentHashMap$TreeNode
> >> --  13:        916970       29343040  java.util.Collections$1
> >> --  14:        916971       22007304  java.util.Collections$
> SingletonSet
> >> --  15:         47270        3781600
> >> java.util.concurrent.ConcurrentHashMap$TreeBin
> >> --  16:         26201        2590912  [C
> >> --  17:         34175        1367000
> >> org.apache.giraph.edge.ByteArrayEdges$ByteArrayEdgeIterator
> >> --  18:          6143        1067704  java.lang.Class
> >> --  19:         25953         830496  java.lang.String
> >> --  20:         34175         820200  org.apache.giraph.edge.
> EdgeNoValue
> >> --  21:          4488         703400  [Ljava.lang.Object;
> >> --  22:            70         395424  [Ljava.nio.channels.SelectionKey;
> >> --  23:          2052         328320  java.lang.reflect.Method
> >> --  24:          6600         316800
> >> org.apache.giraph.utils.ByteArrayVertexIdMessages
> >> --  25:          5781         277488  java.util.HashMap$Node
> >> --  26:          5651         271248  java.util.Hashtable$Entry
> >> --  27:          6604         211328
> >> org.apache.giraph.factories.DefaultMessageValueFactory
> >> 16/11/06 15:47:49 ERROR utils.LogStacktraceCallable: Execution of
> >> callable failed
> >> java.lang.RuntimeException: call: execution of IO command
> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
> >> at
> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:115)
> >> at
> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:36)
> >> at
> >> org.apache.giraph.utils.LogStacktraceCallable.call(
> LogStacktraceCallable.java:67)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >> at
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:745)
> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow.
> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
> >> at com.esotericsoftware.kryo.io.UnsafeInput.readLong(
> UnsafeInput.java:112)
> >> at
> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(
> KryoDataInput.java:91)
> >> at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47)
> >> at
> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(
> DiskBackedPartitionStore.java:245)
> >> at
> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
> >> at
> >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(
> DiskBackedDataStore.java:234)
> >> at
> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(
> DiskBackedPartitionStore.java:311)
> >> at
> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(
> LoadPartitionIOCommand.java:66)
> >> at
> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:99)
> >> ... 6 more
> >> 16/11/06 15:47:49 FATAL graph.GraphTaskManager: uncaughtException:
> >> OverrideExceptionHandler on thread ooc-io-0, msg = call: execution of
> >> IO command LoadPartitionIOCommand: (partitionId = 234, superstep = 2)
> >> failed!, exiting...
> >> java.lang.RuntimeException: call: execution of IO command
> >> LoadPartitionIOCommand: (partitionId = 234, superstep = 2) failed!
> >> at
> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:115)
> >> at
> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:36)
> >> at
> >> org.apache.giraph.utils.LogStacktraceCallable.call(
> LogStacktraceCallable.java:67)
> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> at
> >> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> >> at
> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> >> at java.lang.Thread.run(Thread.java:745)
> >> Caused by: com.esotericsoftware.kryo.KryoException: Buffer underflow.
> >> at com.esotericsoftware.kryo.io.Input.require(Input.java:199)
> >> at com.esotericsoftware.kryo.io.UnsafeInput.readLong(
> UnsafeInput.java:112)
> >> at
> >> com.esotericsoftware.kryo.io.KryoDataInput.readLong(
> KryoDataInput.java:91)
> >> at org.apache.hadoop.io.LongWritable.readFields(LongWritable.java:47)
> >> at
> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.readOutEdges(
> DiskBackedPartitionStore.java:245)
> >> at
> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.
> loadInMemoryPartitionData(DiskBackedPartitionStore.java:278)
> >> at
> >> org.apache.giraph.ooc.data.DiskBackedDataStore.loadPartitionDataProxy(
> DiskBackedDataStore.java:234)
> >> at
> >> org.apache.giraph.ooc.data.DiskBackedPartitionStore.loadPartitionData(
> DiskBackedPartitionStore.java:311)
> >> at
> >> org.apache.giraph.ooc.command.LoadPartitionIOCommand.execute(
> LoadPartitionIOCommand.java:66)
> >> at
> >> org.apache.giraph.ooc.OutOfCoreIOCallable.call(
> OutOfCoreIOCallable.java:99)
> >> ... 6 more
> >> 16/11/06 15:47:49 ERROR worker.BspServiceWorker: unregisterHealth: Got
> >> failure, unregistering health on
> >>
> >> /_hadoopBsp/giraph_yarn_application_1478342673283_
> 0009/_applicationAttemptsDir/0/_superstepDir/1/_
> workerHealthyDir/datanode6.webmeup.com_5
> >> on superstep 1
> >>
> >> We looked into one thread
> >>
> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%
> 3CCAECWHa3MOqubf8--wMVhzqOYwwZ0ZuP6_iiqTE_xT%
> 3DoLJAAPQw%40mail.gmail.com%3E
> >> but it is rather old and at that time the answer was "do not use it
> >> yet".
> >> (see reply
> >> http://mail-archives.apache.org/mod_mbox/giraph-user/201607.mbox/%
> 3CCAH1LQfdbpbZuaKsu1b7TCwOzGMxi_vf9vYi6Xg_Bp8o43H7u%2Bw%
> 40mail.gmail.com%3E).
> >> Does it hold today? We would like to use new advanced adaptive OOC
> >> approach if possible...
> >>
> >> Thank you in advance, any help or hint would be really appreciated.
> >>
> >> Best Regards,
> >> Denis Dudinski
> >
> >
>

Reply via email to