HI Vaghawan, Thanks for the reply. Yes already tried that but still its same getting same error .
Regards, Abhimanyu On Thu, Oct 26, 2017 at 2:40 PM, Vaghawan Ojha <vaghawan...@gmail.com> wrote: > Hi Abhimanyu, > > I've never tried the classification template, So I'm not sure about how > much time would it exactly take. But as per your error, your model is not > going any far from stage 1. "Task 0 in stage 1.0 failed 1 times, " . > > Probably something to do with the OOMs. https://stackoverflow. > com/questions/37260230/spark-cluster-full-of-heartbeat- > timeouts-executors-exiting-on-their-own > > did you see this? > > On Thu, Oct 26, 2017 at 1:57 PM, Abhimanyu Nagrath < > abhimanyunagr...@gmail.com> wrote: > >> Hi Vaghawan, >> >> For debugging I just made a change I just reduced the number if features >> to 1 record count being the same as 1 Million and hardware is (240 GB RAM >> , 32 cores and 100 GB SWAP) and training is still going on since 2 hrs.Is >> it an expected behavior. On which factors does the training time depend. >> >> >> Regards, >> Abhimanyu >> >> >> On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath < >> abhimanyunagr...@gmail.com> wrote: >> >>> Hi Vaghawan, >>> >>> I have made that template compatible with the version mentioned >>> above. Changed versions of engine.json and changed packages name. >>> >>> >>> Regards, >>> Abhimanyu >>> >>> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <vaghawan...@gmail.com> >>> wrote: >>> >>>> Hi Abhimanyu, >>>> >>>> I don't think this template works with version 0.11.0. As per the >>>> template : >>>> >>>> update for PredictionIO 0.9.2, including: >>>> >>>> I don't think it supports the latest pio. You rather switch it to 0.9.2 >>>> if you want to experiment it. >>>> >>>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath < >>>> abhimanyunagr...@gmail.com> wrote: >>>> >>>>> Hi Vaghawan , >>>>> >>>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , >>>>> Spark - 2.1.0). >>>>> >>>>> Regards, >>>>> Abhimanyu >>>>> >>>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <vaghawan...@gmail.com >>>>> > wrote: >>>>> >>>>>> Hi Abhimanyu, >>>>>> >>>>>> Ok, which version of pio is this? Because the template looks old to >>>>>> me. >>>>>> >>>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath < >>>>>> abhimanyunagr...@gmail.com> wrote: >>>>>> >>>>>>> Hi Vaghawan, >>>>>>> >>>>>>> yes, the spark master connection string is correct I am getting >>>>>>> executor fails to connect to spark master after 4-5 hrs. >>>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Abhimanyu >>>>>>> >>>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar < >>>>>>> sachinkam...@gmail.com> wrote: >>>>>>> >>>>>>>> It should be correct, as the user got the exception after 3-4 hours >>>>>>>> of starting. So looks like something else broke. OOM? >>>>>>>> >>>>>>>> With Regards, >>>>>>>> >>>>>>>> Sachin >>>>>>>> ⚜KTBFFH⚜ >>>>>>>> >>>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha < >>>>>>>> vaghawan...@gmail.com> wrote: >>>>>>>> >>>>>>>>> "Executor failed to connect with master ", are you sure the --master >>>>>>>>> spark://*.*.*.*:7077 is correct? >>>>>>>>> >>>>>>>>> Like the one you copied from the spark master's web ui? sometimes >>>>>>>>> having that wrong fails to connect with the spark master. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath < >>>>>>>>> abhimanyunagr...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I am new to predictionIO . I am using template >>>>>>>>>> https://github.com/EmergentOrder/template-scala-probabilisti >>>>>>>>>> c-classifier-batch-lbfgs. >>>>>>>>>> >>>>>>>>>> My training dataset count is 1184603 having approx 6500 features. >>>>>>>>>> I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I tried two ways for training >>>>>>>>>> >>>>>>>>>> 1. Command ' >>>>>>>>>> >>>>>>>>>> > pio train -- --driver-memory 120G --executor-memory 100G -- conf >>>>>>>>>> > spark.network.timeout=10000000 >>>>>>>>>> >>>>>>>>>> ' >>>>>>>>>> Its throwing exception after 3-4 hours. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Exception in thread "main" org.apache.spark.SparkException: >>>>>>>>>> Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 >>>>>>>>>> times, most >>>>>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, >>>>>>>>>> executor >>>>>>>>>> driver): ExecutorLostFailure (executor driver exited caused by one >>>>>>>>>> of the >>>>>>>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms >>>>>>>>>> Driver stacktrace: >>>>>>>>>> at org.apache.spark.scheduler.DAGScheduler.org >>>>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn >>>>>>>>>> dIndependentStages(DAGScheduler.scala:1435) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) >>>>>>>>>> at scala.collection.mutable.Resiz >>>>>>>>>> ableArray$class.foreach(ResizableArray.scala:59) >>>>>>>>>> at scala.collection.mutable.Array >>>>>>>>>> Buffer.foreach(ArrayBuffer.scala:48) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> Scheduler.abortStage(DAGScheduler.scala:1422) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler. >>>>>>>>>> scala:802) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler. >>>>>>>>>> scala:802) >>>>>>>>>> at scala.Option.foreach(Option.scala:257) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) >>>>>>>>>> at org.apache.spark.util.EventLoo >>>>>>>>>> p$$anon$1.run(EventLoop.scala:48) >>>>>>>>>> at org.apache.spark.scheduler.DAG >>>>>>>>>> Scheduler.runJob(DAGScheduler.scala:628) >>>>>>>>>> at org.apache.spark.SparkContext. >>>>>>>>>> runJob(SparkContext.scala:1918) >>>>>>>>>> at org.apache.spark.SparkContext. >>>>>>>>>> runJob(SparkContext.scala:1931) >>>>>>>>>> at org.apache.spark.SparkContext. >>>>>>>>>> runJob(SparkContext.scala:1944) >>>>>>>>>> at org.apache.spark.rdd.RDD$$anon >>>>>>>>>> fun$take$1.apply(RDD.scala:1353) >>>>>>>>>> at org.apache.spark.rdd.RDDOperat >>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:151) >>>>>>>>>> at org.apache.spark.rdd.RDDOperat >>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:112) >>>>>>>>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) >>>>>>>>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1326) >>>>>>>>>> at org.example.classification.Log >>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi >>>>>>>>>> thLBFGSAlgorithm.scala:28) >>>>>>>>>> at org.example.classification.Log >>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi >>>>>>>>>> thLBFGSAlgorithm.scala:21) >>>>>>>>>> at org.apache.predictionio.contro >>>>>>>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49) >>>>>>>>>> at org.apache.predictionio.contro >>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692) >>>>>>>>>> at org.apache.predictionio.contro >>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692) >>>>>>>>>> at scala.collection.TraversableLi >>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) >>>>>>>>>> at scala.collection.TraversableLi >>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) >>>>>>>>>> at scala.collection.immutable.Lis >>>>>>>>>> t.foreach(List.scala:381) >>>>>>>>>> at scala.collection.TraversableLi >>>>>>>>>> ke$class.map(TraversableLike.scala:234) >>>>>>>>>> at scala.collection.immutable.Lis >>>>>>>>>> t.map(List.scala:285) >>>>>>>>>> at org.apache.predictionio.contro >>>>>>>>>> ller.Engine$.train(Engine.scala:692) >>>>>>>>>> at org.apache.predictionio.contro >>>>>>>>>> ller.Engine.train(Engine.scala:177) >>>>>>>>>> at org.apache.predictionio.workfl >>>>>>>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67) >>>>>>>>>> at org.apache.predictionio.workfl >>>>>>>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250) >>>>>>>>>> at org.apache.predictionio.workfl >>>>>>>>>> ow.CreateWorkflow.main(CreateWorkflow.scala) >>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>>>>>>> Method) >>>>>>>>>> at sun.reflect.NativeMethodAccess >>>>>>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62) >>>>>>>>>> at sun.reflect.DelegatingMethodAc >>>>>>>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>>>>>>>> at org.apache.spark.deploy.SparkS >>>>>>>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub >>>>>>>>>> mit.scala:738) >>>>>>>>>> at org.apache.spark.deploy.SparkS >>>>>>>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187) >>>>>>>>>> at org.apache.spark.deploy.SparkS >>>>>>>>>> ubmit$.submit(SparkSubmit.scala:212) >>>>>>>>>> at org.apache.spark.deploy.SparkS >>>>>>>>>> ubmit$.main(SparkSubmit.scala:126) >>>>>>>>>> at org.apache.spark.deploy.SparkS >>>>>>>>>> ubmit.main(SparkSubmit.scala) >>>>>>>>>> >>>>>>>>>> 2. I started spark standalone cluster with 1 master and 3 workers >>>>>>>>>> and executed the command >>>>>>>>>> >>>>>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G >>>>>>>>>> > --executor-memory 50G >>>>>>>>>> >>>>>>>>>> And after some times getting the error . Executor failed to >>>>>>>>>> connect with master and training gets stopped. >>>>>>>>>> >>>>>>>>>> I have changed the feature count from 6500 - > 500 and still the >>>>>>>>>> condition is same. So can anyone suggest me am I missing something >>>>>>>>>> >>>>>>>>>> and In between training getting continuous warnings like : >>>>>>>>>> [ >>>>>>>>>> >>>>>>>>>> > WARN] [ScannerCallable] Ignore, probably already closed >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Abhimanyu >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >