Hi Abhimanyu, I don't think this template works with version 0.11.0. As per the template :
update for PredictionIO 0.9.2, including: I don't think it supports the latest pio. You rather switch it to 0.9.2 if you want to experiment it. On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath < [email protected]> wrote: > Hi Vaghawan , > > I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , Spark - > 2.1.0). > > Regards, > Abhimanyu > > On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <[email protected]> > wrote: > >> Hi Abhimanyu, >> >> Ok, which version of pio is this? Because the template looks old to me. >> >> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath < >> [email protected]> wrote: >> >>> Hi Vaghawan, >>> >>> yes, the spark master connection string is correct I am getting executor >>> fails to connect to spark master after 4-5 hrs. >>> >>> >>> Regards, >>> Abhimanyu >>> >>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <[email protected]> >>> wrote: >>> >>>> It should be correct, as the user got the exception after 3-4 hours of >>>> starting. So looks like something else broke. OOM? >>>> >>>> With Regards, >>>> >>>> Sachin >>>> ⚜KTBFFH⚜ >>>> >>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <[email protected]> >>>> wrote: >>>> >>>>> "Executor failed to connect with master ", are you sure the --master >>>>> spark://*.*.*.*:7077 is correct? >>>>> >>>>> Like the one you copied from the spark master's web ui? sometimes >>>>> having that wrong fails to connect with the spark master. >>>>> >>>>> Thanks >>>>> >>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath < >>>>> [email protected]> wrote: >>>>> >>>>>> I am new to predictionIO . I am using template >>>>>> https://github.com/EmergentOrder/template-scala-probabilisti >>>>>> c-classifier-batch-lbfgs. >>>>>> >>>>>> My training dataset count is 1184603 having approx 6500 features. I >>>>>> am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap). >>>>>> >>>>>> >>>>>> I tried two ways for training >>>>>> >>>>>> 1. Command ' >>>>>> >>>>>> > pio train -- --driver-memory 120G --executor-memory 100G -- conf >>>>>> > spark.network.timeout=10000000 >>>>>> >>>>>> ' >>>>>> Its throwing exception after 3-4 hours. >>>>>> >>>>>> >>>>>> Exception in thread "main" org.apache.spark.SparkException: Job >>>>>> aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most >>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, executor >>>>>> driver): ExecutorLostFailure (executor driver exited caused by one of the >>>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms >>>>>> Driver stacktrace: >>>>>> at org.apache.spark.scheduler.DAGScheduler.org >>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn >>>>>> dIndependentStages(DAGScheduler.scala:1435) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) >>>>>> at scala.collection.mutable.Resiz >>>>>> ableArray$class.foreach(ResizableArray.scala:59) >>>>>> at scala.collection.mutable.Array >>>>>> Buffer.foreach(ArrayBuffer.scala:48) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> Scheduler.abortStage(DAGScheduler.scala:1422) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler. >>>>>> scala:802) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler. >>>>>> scala:802) >>>>>> at scala.Option.foreach(Option.scala:257) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) >>>>>> at org.apache.spark.util.EventLoo >>>>>> p$$anon$1.run(EventLoop.scala:48) >>>>>> at org.apache.spark.scheduler.DAG >>>>>> Scheduler.runJob(DAGScheduler.scala:628) >>>>>> at org.apache.spark.SparkContext. >>>>>> runJob(SparkContext.scala:1918) >>>>>> at org.apache.spark.SparkContext. >>>>>> runJob(SparkContext.scala:1931) >>>>>> at org.apache.spark.SparkContext. >>>>>> runJob(SparkContext.scala:1944) >>>>>> at org.apache.spark.rdd.RDD$$anon >>>>>> fun$take$1.apply(RDD.scala:1353) >>>>>> at org.apache.spark.rdd.RDDOperat >>>>>> ionScope$.withScope(RDDOperationScope.scala:151) >>>>>> at org.apache.spark.rdd.RDDOperat >>>>>> ionScope$.withScope(RDDOperationScope.scala:112) >>>>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) >>>>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1326) >>>>>> at org.example.classification.Log >>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi >>>>>> thLBFGSAlgorithm.scala:28) >>>>>> at org.example.classification.Log >>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi >>>>>> thLBFGSAlgorithm.scala:21) >>>>>> at org.apache.predictionio.contro >>>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49) >>>>>> at org.apache.predictionio.contro >>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692) >>>>>> at org.apache.predictionio.contro >>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692) >>>>>> at scala.collection.TraversableLi >>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) >>>>>> at scala.collection.TraversableLi >>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) >>>>>> at scala.collection.immutable.Lis >>>>>> t.foreach(List.scala:381) >>>>>> at scala.collection.TraversableLi >>>>>> ke$class.map(TraversableLike.scala:234) >>>>>> at scala.collection.immutable.List.map(List.scala:285) >>>>>> at org.apache.predictionio.contro >>>>>> ller.Engine$.train(Engine.scala:692) >>>>>> at org.apache.predictionio.contro >>>>>> ller.Engine.train(Engine.scala:177) >>>>>> at org.apache.predictionio.workfl >>>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67) >>>>>> at org.apache.predictionio.workfl >>>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250) >>>>>> at org.apache.predictionio.workfl >>>>>> ow.CreateWorkflow.main(CreateWorkflow.scala) >>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>> Method) >>>>>> at sun.reflect.NativeMethodAccess >>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62) >>>>>> at sun.reflect.DelegatingMethodAc >>>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>>>> at org.apache.spark.deploy.SparkS >>>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub >>>>>> mit.scala:738) >>>>>> at org.apache.spark.deploy.SparkS >>>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187) >>>>>> at org.apache.spark.deploy.SparkS >>>>>> ubmit$.submit(SparkSubmit.scala:212) >>>>>> at org.apache.spark.deploy.SparkS >>>>>> ubmit$.main(SparkSubmit.scala:126) >>>>>> at org.apache.spark.deploy.SparkS >>>>>> ubmit.main(SparkSubmit.scala) >>>>>> >>>>>> 2. I started spark standalone cluster with 1 master and 3 workers and >>>>>> executed the command >>>>>> >>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G >>>>>> > --executor-memory 50G >>>>>> >>>>>> And after some times getting the error . Executor failed to connect >>>>>> with master and training gets stopped. >>>>>> >>>>>> I have changed the feature count from 6500 - > 500 and still the >>>>>> condition is same. So can anyone suggest me am I missing something >>>>>> >>>>>> and In between training getting continuous warnings like : >>>>>> [ >>>>>> >>>>>> > WARN] [ScannerCallable] Ignore, probably already closed >>>>>> >>>>>> >>>>>> Regards, >>>>>> Abhimanyu >>>>>> >>>>>> >>>>> >>>> >>> >> >
