[jira] [Commented] (GIRAPH-114) Inconsistent message map handling in BasicRPCCommunications.LargeMessageFlushExecutor
[ https://issues.apache.org/jira/browse/GIRAPH-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174293#comment-13174293 ] Hudson commented on GIRAPH-114: --- Integrated in Giraph-trunk-Commit #57 (See [https://builds.apache.org/job/Giraph-trunk-Commit/57/]) GIRAPH-114: Inconsistent message map handling in BasicRPCCommunications.LargeMessageFlushExecutor. (ssc via aching) aching : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1221836 Files : * /incubator/giraph/trunk/CHANGELOG * /incubator/giraph/trunk/src/main/java/org/apache/giraph/comm/BasicRPCCommunications.java > Inconsistent message map handling in > BasicRPCCommunications.LargeMessageFlushExecutor > - > > Key: GIRAPH-114 > URL: https://issues.apache.org/jira/browse/GIRAPH-114 > Project: Giraph > Issue Type: Bug >Affects Versions: 0.70.0 >Reporter: Sebastian Schelter >Priority: Critical > Attachments: GIRAPH-114.patch > > > I'm currently implementing a simple algorithm to identify all the connected > components of a graph. The algorithm ran well in a local IDE unit tests on > toy data and in a local single node hadoop instance using a graph of ~100k > edges. > When I tested it on a real cluster with the wikipedia pagelink graph (5.7M > vertices, 130M edges), I ran into strange exceptions like this: > {noformat} > 2011-12-21 12:03:57,015 INFO org.apache.hadoop.mapred.TaskInProgress: Error > from attempt_201112131541_0034_m_27_0: java.lang.IllegalStateException: > run: Caught an unrecoverable exception flush: Got ExecutionException > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:641) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) > at org.apache.hadoop.mapred.Child$4.run(Child.java:259) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:253) > Caused by: java.lang.IllegalStateException: flush: Got ExecutionException > at > org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:946) > at > org.apache.giraph.graph.BspServiceWorker.finishSuperstep(BspServiceWorker.java:916) > at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:588) > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:632) > ... 7 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.IllegalStateException: run: Impossible for no messages in 1603276 > at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) > at java.util.concurrent.FutureTask.get(FutureTask.java:83) > at > org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:941) > ... 10 more > Caused by: java.lang.IllegalStateException: run: Impossible for no messages > in 1603276 > at > org.apache.giraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommunications.java:245) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > {noformat} > The exception is thrown because a vertex with no message to send to is found > in the datastructure holding the outgoing messages. > I tracked this behavior down: > In *BasicRPCCommunications:541-546* the map holding the outgoing messages for > vertices of a particular machine is created. It's stored in two places > _BasicRPCCommunications.outMessages_ and as member variable > _outMessagesPerPeer_ of its _PeerConnection_ : > {noformat} > outMsgMap = new HashMap>(); > outMessages.put(addrUnresolved, outMsgMap); > PeerConnection peerConnection = new PeerConnection(outMsgMap, peer, isProxy); > {noformat} > > In case that there are a lot of messages available for a particular vertex, a > large flush is trigged via _LargeMessageFlushExecutor_ (I guess this only > happened in the wikipedia test). During this flush the list of messages for > the vertex is sent out and replaced with an empty list in > *BasicRPCCommunications:341* > {noformat} > outMessageList = peerConnection.outMessagesPerPeer.get(destVertex); > peerConnection.outMessagesPerPeer.put
[jira] [Commented] (GIRAPH-114) Inconsistent message map handling in BasicRPCCommunications.LargeMessageFlushExecutor
[ https://issues.apache.org/jira/browse/GIRAPH-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174283#comment-13174283 ] Avery Ching commented on GIRAPH-114: +1, nice find! The whole RPC thing is a bit messy now, agreed. > Inconsistent message map handling in > BasicRPCCommunications.LargeMessageFlushExecutor > - > > Key: GIRAPH-114 > URL: https://issues.apache.org/jira/browse/GIRAPH-114 > Project: Giraph > Issue Type: Bug >Affects Versions: 0.70.0 >Reporter: Sebastian Schelter >Priority: Critical > Attachments: GIRAPH-114.patch > > > I'm currently implementing a simple algorithm to identify all the connected > components of a graph. The algorithm ran well in a local IDE unit tests on > toy data and in a local single node hadoop instance using a graph of ~100k > edges. > When I tested it on a real cluster with the wikipedia pagelink graph (5.7M > vertices, 130M edges), I ran into strange exceptions like this: > {noformat} > 2011-12-21 12:03:57,015 INFO org.apache.hadoop.mapred.TaskInProgress: Error > from attempt_201112131541_0034_m_27_0: java.lang.IllegalStateException: > run: Caught an unrecoverable exception flush: Got ExecutionException > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:641) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) > at org.apache.hadoop.mapred.Child$4.run(Child.java:259) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > at org.apache.hadoop.mapred.Child.main(Child.java:253) > Caused by: java.lang.IllegalStateException: flush: Got ExecutionException > at > org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:946) > at > org.apache.giraph.graph.BspServiceWorker.finishSuperstep(BspServiceWorker.java:916) > at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:588) > at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:632) > ... 7 more > Caused by: java.util.concurrent.ExecutionException: > java.lang.IllegalStateException: run: Impossible for no messages in 1603276 > at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) > at java.util.concurrent.FutureTask.get(FutureTask.java:83) > at > org.apache.giraph.comm.BasicRPCCommunications.flush(BasicRPCCommunications.java:941) > ... 10 more > Caused by: java.lang.IllegalStateException: run: Impossible for no messages > in 1603276 > at > org.apache.giraph.comm.BasicRPCCommunications$PeerFlushExecutor.run(BasicRPCCommunications.java:245) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:662) > {noformat} > The exception is thrown because a vertex with no message to send to is found > in the datastructure holding the outgoing messages. > I tracked this behavior down: > In *BasicRPCCommunications:541-546* the map holding the outgoing messages for > vertices of a particular machine is created. It's stored in two places > _BasicRPCCommunications.outMessages_ and as member variable > _outMessagesPerPeer_ of its _PeerConnection_ : > {noformat} > outMsgMap = new HashMap>(); > outMessages.put(addrUnresolved, outMsgMap); > PeerConnection peerConnection = new PeerConnection(outMsgMap, peer, isProxy); > {noformat} > > In case that there are a lot of messages available for a particular vertex, a > large flush is trigged via _LargeMessageFlushExecutor_ (I guess this only > happened in the wikipedia test). During this flush the list of messages for > the vertex is sent out and replaced with an empty list in > *BasicRPCCommunications:341* > {noformat} > outMessageList = peerConnection.outMessagesPerPeer.get(destVertex); > peerConnection.outMessagesPerPeer.put(destVertex, new MsgList()); > {noformat} > Now in the last flush that is trigggered at the end of the superstep we > encounter an empty message list for the vertex and therefore the exception is > thrown in *BasicRPCCommunications:228-247* > {noformat} > for (Entry> entry : > peerConnection.outMessagesPerPeer.entrySet()) { > ... > if (entry.getValue().isEmpty