"Executor failed to connect with master ", are you sure the --master
spark://*.*.*.*:7077 is correct?

Like the one you copied from the spark master's web ui? sometimes having
that wrong fails to connect with the spark master.

Thanks

On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <
[email protected]> wrote:

> I am new to predictionIO . I am using template https://github.com/
> EmergentOrder/template-scala-probabilistic-classifier-batch-lbfgs.
>
> My training dataset count is 1184603 having approx 6500 features. I am
> using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap).
>
>
> I tried two ways for training
>
>  1. Command '
>
> > pio train -- --driver-memory 120G --executor-memory 100G -- conf
> > spark.network.timeout=10000000
>
> '
>   Its throwing exception after 3-4 hours.
>
>
>     Exception in thread "main" org.apache.spark.SparkException: Job
> aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most
> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, executor
> driver): ExecutorLostFailure (executor driver exited caused by one of the
> running tasks) Reason: Executor heartbeat timed out after 181529 ms
>     Driver stacktrace:
>             at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> scheduler$DAGScheduler$$failJobAndIndependentStages(
> DAGScheduler.scala:1435)
>             at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> abortStage$1.apply(DAGScheduler.scala:1423)
>             at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> abortStage$1.apply(DAGScheduler.scala:1422)
>             at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
>             at scala.collection.mutable.ArrayBuffer.foreach(
> ArrayBuffer.scala:48)
>             at org.apache.spark.scheduler.DAGScheduler.abortStage(
> DAGScheduler.scala:1422)
>             at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>             at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>             at scala.Option.foreach(Option.scala:257)
>             at org.apache.spark.scheduler.DAGScheduler.
> handleTaskSetFailed(DAGScheduler.scala:802)
>             at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> doOnReceive(DAGScheduler.scala:1650)
>             at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1605)
>             at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1594)
>             at org.apache.spark.util.EventLoop$$anon$1.run(
> EventLoop.scala:48)
>             at org.apache.spark.scheduler.DAGScheduler.runJob(
> DAGScheduler.scala:628)
>             at org.apache.spark.SparkContext.runJob(SparkContext.scala:
> 1918)
>             at org.apache.spark.SparkContext.runJob(SparkContext.scala:
> 1931)
>             at org.apache.spark.SparkContext.runJob(SparkContext.scala:
> 1944)
>             at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.
> scala:1353)
>             at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
>             at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
>             at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>             at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>             at org.example.classification.LogisticRegressionWithLBFGSAlg
> orithm.train(LogisticRegressionWithLBFGSAlgorithm.scala:28)
>             at org.example.classification.LogisticRegressionWithLBFGSAlg
> orithm.train(LogisticRegressionWithLBFGSAlgorithm.scala:21)
>             at org.apache.predictionio.controller.P2LAlgorithm.
> trainBase(P2LAlgorithm.scala:49)
>             at org.apache.predictionio.controller.Engine$$anonfun$18.
> apply(Engine.scala:692)
>             at org.apache.predictionio.controller.Engine$$anonfun$18.
> apply(Engine.scala:692)
>             at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>             at scala.collection.TraversableLike$$anonfun$map$
> 1.apply(TraversableLike.scala:234)
>             at scala.collection.immutable.List.foreach(List.scala:381)
>             at scala.collection.TraversableLike$class.map(
> TraversableLike.scala:234)
>             at scala.collection.immutable.List.map(List.scala:285)
>             at org.apache.predictionio.controller.Engine$.train(
> Engine.scala:692)
>             at org.apache.predictionio.controller.Engine.train(
> Engine.scala:177)
>             at org.apache.predictionio.workflow.CoreWorkflow$.
> runTrain(CoreWorkflow.scala:67)
>             at org.apache.predictionio.workflow.CreateWorkflow$.main(
> CreateWorkflow.scala:250)
>             at org.apache.predictionio.workflow.CreateWorkflow.main(
> CreateWorkflow.scala)
>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>             at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
>             at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>             at java.lang.reflect.Method.invoke(Method.java:498)
>             at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$
> deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>             at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(
> SparkSubmit.scala:187)
>             at org.apache.spark.deploy.SparkSubmit$.submit(
> SparkSubmit.scala:212)
>             at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.
> scala:126)
>             at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> 2. I started spark standalone cluster with 1 master and 3 workers and
> executed the command
>
> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G
> > --executor-memory 50G
>
> And after some times getting the error . Executor failed to connect with
> master and training gets stopped.
>
> I have changed the feature count from 6500 - > 500 and still the condition
> is same. So can anyone suggest me am I missing something
>
> and In between training getting continuous warnings like :
> [
>
> > WARN] [ScannerCallable] Ignore, probably already closed
>
>
> Regards,
> Abhimanyu
>
>

Reply via email to