Re: Spark 2.4.0 Master going down
Hi Akshay Thanks for the response please find below the answers to your questions. 1. We are running Spark in cluster mode the cluster manager being Spark's standalone cluster manager. 2. All the ports are open and we preconfigure on what ports the communication should happen and modify firewall rules to allow traffic on these ports. (The functionality is fine till Spark master goes down after 60 mins) 3. Memory consumptions of all the components: Spark Master: S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 0.00 12.91 35.11 97.08 95.80 50.239 20.197 0.436 Spark Worker: S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 51.64 0.00 46.66 27.44 97.57 95.85 100.381 20.233 0.613 Spark Submit Process (Driver): S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 63.57 93.82 26.29 98.24 97.53 4663 124.648 109 20.910 145.558 Spark executor (Coarse grained): S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 69.77 17.74 31.13 95.67 90.44 7353 556.888 51.572 558.460 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.4.0 Master going down
Hi Akshay Thanks for the response please find below the answers to your questions. 1. We are running Spark in cluster mode the cluster manager being Spark's standalone cluster manager. 2. All the ports are open and we preconfigure on what ports the communication should happen and modify firewall rules to allow traffic on these ports. (The functionality is fine till Spark master goes down after 60 mins) 3. Memory consumptions of all the components: Spark Master: S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 0.00 12.91 35.11 97.08 95.80 50.239 20.197 0.436 Spark Worker: S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 51.64 0.00 46.66 27.44 97.57 95.85 100.381 20.233 0.613 Spark Submit Process (Driver): S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 63.57 93.82 26.29 98.24 97.53 4663 124.648 109 20.910 145.558 Spark executor (Coarse grained): S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 69.77 17.74 31.13 95.67 90.44 7353 556.888 51.572 558.460 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark 2.4.0 Master going down
Hi Akshay Thanks for the response please find below the answers to your questions. 1. We are running Spark in cluster mode the cluster manager being Spark's standalone cluster manager. 2. All the ports are open and we preconfigure on what ports the communication should happen and modify firewall rules to allow traffic on these ports. (The functionality is fine till Spark master goes down after 60 mins) 3. Memory consumptions of all the components: Spark Master: S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 0.00 12.91 35.11 97.08 95.80 50.239 20.197 0.436 Spark Worker: S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 51.64 0.00 46.66 27.44 97.57 95.85 100.381 20.233 0.613 Spark Submit Process (Driver): S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 63.57 93.82 26.29 98.24 97.53 4663 124.648 109 20.910 145.558 Spark executor (Coarse grained): S0 S1 E O M CCSYGC YGCTFGCFGCT GCT 0.00 69.77 17.74 31.13 95.67 90.44 7353 556.888 51.572 558.460 On Thu, Feb 28, 2019 at 3:13 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Lokesh, > > Please provide further information to help identify the issue. > > 1) Are you running in a standalone mode or cluster mode? If cluster, then > is a spark master/slave or YARN/Mesos? > 2) Have you tried checking if all ports between your master and the > machine with IP 192.168.43.167 are accessible? > 3) Have you checked the memory consumption of the executors/driver running > in the cluster? > > > Akshay Bhardwaj > +91-97111-33849 > > > On Wed, Feb 27, 2019 at 8:27 PM lokeshkumar wrote: > >> Hi All >> >> We are running Spark version 2.4.0 and we run few Spark streaming jobs >> listening on Kafka topics. We receive an average of 10-20 msgs per >> second. >> And the Spark master has been going down after 1-2 hours of it running. >> Exception is given below: >> Along with that spark executors also get killed. >> >> This was not happening with Spark 2.1.1 it started happening with Spark >> 2.4.0 any help/suggestion is appreciated. >> >> The exception that we see is >> >> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException >> at >> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) >> at >> >> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) >> at >> >> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) >> at >> >> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281) >> at >> >> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) >> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any >> reply from 192.168.43.167:40007 in 120 seconds. This timeout is >> controlled >> by spark.rpc.askTimeout >> at >> org.apache.spark.rpc.RpcTimeout.org >> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) >> at >> >> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) >> at >> >> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) >> at >> >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) >> at scala.util.Try$.apply(Try.scala:192) >> at scala.util.Failure.recover(Try.scala:216) >> at >> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) >> at >> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) >> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >> at >> >> org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) >> at >> >> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) >> at >> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) >> at >> >> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) >> at scala.concurrent.Promise$class.complete(Promise.scala:55) >> at >> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157) >> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) >> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) >> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >> at >> >> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) >> at >> >> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Bat
Re: Spark 2.4.0 Master going down
Hi Lokesh, Please provide further information to help identify the issue. 1) Are you running in a standalone mode or cluster mode? If cluster, then is a spark master/slave or YARN/Mesos? 2) Have you tried checking if all ports between your master and the machine with IP 192.168.43.167 are accessible? 3) Have you checked the memory consumption of the executors/driver running in the cluster? Akshay Bhardwaj +91-97111-33849 On Wed, Feb 27, 2019 at 8:27 PM lokeshkumar wrote: > Hi All > > We are running Spark version 2.4.0 and we run few Spark streaming jobs > listening on Kafka topics. We receive an average of 10-20 msgs per second. > And the Spark master has been going down after 1-2 hours of it running. > Exception is given below: > Along with that spark executors also get killed. > > This was not happening with Spark 2.1.1 it started happening with Spark > 2.4.0 any help/suggestion is appreciated. > > The exception that we see is > > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException > at > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) > at > > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) > at > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) > at > > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281) > at > > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any > reply from 192.168.43.167:40007 in 120 seconds. This timeout is controlled > by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > at > > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) > at > > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) > at > > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) > at scala.util.Try$.apply(Try.scala:192) > at scala.util.Failure.recover(Try.scala:216) > at > scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) > at > scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > > org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) > at > > scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at scala.concurrent.Promise$class.complete(Promise.scala:55) > at > scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) > at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) > at > > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) > at > > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) > at > > scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) > at > scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) > at > scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) > at > > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) > at > scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:157) > at > org.apache.spark.rpc.netty.NettyRpcEnv.org > $apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:206) > at > org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:243) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent