Tried GraphFrames. Still faced the same - job died after a few hours . The errors I see (And I see tons of them) are - (I ran with 3 times the partitions as well, which was 12 times number of executors , but still the same.)
------------------------------------- ERROR NativeAzureFileSystem: Encountered Storage Exception for write on Blob : hdp/spark2-events/application_1478717432179_0021.inprogress Exception details: null Error Code : RequestBodyTooLarge ------------------------------------- 16/11/11 09:21:46 ERROR TransportResponseHandler: Still have 3 requests outstanding when connection from /10.0.0.95:43301 is closed 16/11/11 09:21:46 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 2 outstanding blocks after 5000 ms 16/11/11 09:21:46 INFO ShuffleBlockFetcherIterator: Getting 1500 non-empty blocks out of 1500 blocks 16/11/11 09:21:46 ERROR OneForOneBlockFetcher: Failed while starting block fetches java.io.IOException: Connection from /10.0.0.95:43301 closed ------------------------------------- 16/11/11 09:21:46 ERROR OneForOneBlockFetcher: Failed while starting block fetches java.lang.RuntimeException: java.io.FileNotFoundException: /mnt/resource/hadoop/yarn/local/usercache/shreyagrssh/appcache/application_1478717432179_0021/blockmgr-b1dde30d-359e-4932-b7a4-a5e138a52360/37/shuffle_1346_21_0.index (No such file or directory) ------------------------------------- org.apache.spark.SparkException: Exception thrown in awaitResult at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.ConcurrentModificationException at java.util.ArrayList.writeObject(ArrayList.java:766) at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) ------------------------------------- 16/11/11 13:21:54 WARN Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Error sending message [message = Heartbeat(537,[Lscala.Tuple2;@2999dae4,BlockManagerId(537, 10.0.0.103, 36162))] at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:119) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) ... 13 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) From: Shreya Agarwal Sent: Thursday, November 10, 2016 8:16 PM To: 'Felix Cheung' <felixcheun...@hotmail.com>; user@spark.apache.org Subject: RE: Strongly Connected Components Yesterday's run died sometime during the night, without any errors. Today, I am running it using GraphFrames instead. It is still spawning new tasks, so there is progress. From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Thursday, November 10, 2016 7:50 PM To: user@spark.apache.org<mailto:user@spark.apache.org>; Shreya Agarwal <shrey...@microsoft.com<mailto:shrey...@microsoft.com>> Subject: Re: Strongly Connected Components It is possible it is dead. Could you check the Spark UI to see if there is any progress? _____________________________ From: Shreya Agarwal <shrey...@microsoft.com<mailto:shrey...@microsoft.com>> Sent: Thursday, November 10, 2016 12:45 AM Subject: RE: Strongly Connected Components To: <user@spark.apache.org<mailto:user@spark.apache.org>> Bump. Anyone? Its been running for 10 hours now. No results. From: Shreya Agarwal Sent: Tuesday, November 8, 2016 9:05 PM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Strongly Connected Components Hi, I am running this on a graph with >5B edges and >3B edges and have 2 questions - 1. What is the optimal number of iterations? 2. I am running it for 1 iteration right now on a beefy 100 node cluster, with 300 executors each having 30GB RAM and 5 cores. I have persisted the graph to MEMORY_AND_DISK. And it has been running for 3 hours already. Any ideas on how to speed this up? Regards, Shreya