On Oct 26 2017, at 12:39 pm, Vaghawan Ojha <[email protected]> wrote:
Hi Abhimanyu,I don't think this template works with version 0.11.0. As per the template :
update for PredictionIO 0.9.2, including:
I don't think it supports the latest pio. You rather switch it to 0.9.2 if you want to experiment it.On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <[email protected]> wrote:Hi Vaghawan ,I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , Spark - 2.1.0).Regards,AbhimanyuOn Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <[email protected]> wrote:Hi Abhimanyu,
Ok, which version of pio is this? Because the template looks old to me.On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <[email protected]> wrote:Hi Vaghawan,yes, the spark master connection string is correct I am getting executor fails to connect to spark master after 4-5 hrs.Regards,AbhimanyuOn Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <[email protected]> wrote:It should be correct, as the user got the exception after 3-4 hours of starting. So looks like something else broke. OOM?With Regards,Sachin⚜KTBFFH⚜On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <[email protected]> wrote:"Executor failed to connect with master ", are you sure the --master spark://*.*.*.*:7077 is correct?Like the one you copied from the spark master's web ui? sometimes having that wrong fails to connect with the spark master.ThanksOn Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <[email protected]> wrote:I am new to predictionIO . I am using template https://github.com/EmergentOrder/template-scala-probabilisti c-classifier-batch-lbfgs. My training dataset count is 1184603 having approx 6500 features. I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap).I tried two ways for training1. Command '> pio train -- --driver-memory 120G --executor-memory 100G -- conf> spark.network.timeout=10000000'Its throwing exception after 3-4 hours.Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 181529 ms Driver stacktrace:at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$sch eduler$DAGScheduler$$failJobAn dIndependentStages(DAGSchedule r.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$ 1.apply(DAGScheduler.scala:142 3) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$ 1.apply(DAGScheduler.scala:142 2) at scala.collection.mutable.ResizableArray$class.foreach(Resiza bleArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.sca la:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSchedu ler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS etFailed$1.apply(DAGScheduler. scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS etFailed$1.apply(DAGScheduler. scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed( DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOn Receive(DAGScheduler.scala:165 0) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe ceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe ceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala: 48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler. scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918 ) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931 ) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944 ) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:135 3) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati onScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati onScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.apache.spark.rdd.RDD.take(RDD.scala:1326) at org.example.classification.LogisticRegressionWithLBFGSAlgori thm.train(LogisticRegressionWi thLBFGSAlgorithm.scala:28) at org.example.classification.LogisticRegressionWithLBFGSAlgori thm.train(LogisticRegressionWi thLBFGSAlgorithm.scala:21) at org.apache.predictionio.controller.P2LAlgorithm.trainBase(P2 LAlgorithm.scala:49) at org.apache.predictionio.controller.Engine$$anonfun$18.apply( Engine.scala:692) at org.apache.predictionio.controller.Engine$$anonfun$18.apply( Engine.scala:692) at scala.collection.TraversableLike$$anonfun$map$1.apply(Traver sableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(Traver sableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.s cala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.predictionio.controller.Engine$.train(Engine.scal a:692) at org.apache.predictionio.controller.Engine.train(Engine.scala :177) at org.apache.predictionio.workflow.CoreWorkflow$.runTrain(Core Workflow.scala:67) at org.apache.predictionio.workflow.CreateWorkflow$.main(Create Workflow.scala:250) at org.apache.predictionio.workflow.CreateWorkflow.main(CreateW orkflow.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce ssorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe thodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy $SparkSubmit$$runMain(SparkSub mit.scala:738) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit .scala:187) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scal a:212) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala: 126) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2. I started spark standalone cluster with 1 master and 3 workers and executed the command> pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G> --executor-memory 50GAnd after some times getting the error . Executor failed to connect with master and training gets stopped.I have changed the feature count from 6500 - > 500 and still the condition is same. So can anyone suggest me am I missing somethingand In between training getting continuous warnings like :[> WARN] [ScannerCallable] Ignore, probably already closedRegards,Abhimanyu
