Hi Abhimanyu, Is there more information from Spark web UI, or pio.log from where you run the `pio train` command? Also, sharing your full modifications somewhere on GitHub will be very helpful.
Regards, Donald On Thu, Oct 26, 2017 at 2:22 AM Abhimanyu Nagrath < [email protected]> wrote: > HI Vaghawan, > Thanks for the reply. Yes already tried that but still its same getting > same error . > > Regards, > Abhimanyu > > On Thu, Oct 26, 2017 at 2:40 PM, Vaghawan Ojha <[email protected]> > wrote: > >> Hi Abhimanyu, >> >> I've never tried the classification template, So I'm not sure about how >> much time would it exactly take. But as per your error, your model is not >> going any far from stage 1. "Task 0 in stage 1.0 failed 1 times, " . >> >> Probably something to do with the OOMs. >> https://stackoverflow.com/questions/37260230/spark-cluster-full-of-heartbeat-timeouts-executors-exiting-on-their-own >> >> >> did you see this? >> >> On Thu, Oct 26, 2017 at 1:57 PM, Abhimanyu Nagrath < >> [email protected]> wrote: >> >>> Hi Vaghawan, >>> >>> For debugging I just made a change I just reduced the number if features >>> to 1 record count being the same as 1 Million and hardware is (240 GB RAM >>> , 32 cores and 100 GB SWAP) and training is still going on since 2 hrs.Is >>> it an expected behavior. On which factors does the training time depend. >>> >>> >>> Regards, >>> Abhimanyu >>> >>> >>> On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath < >>> [email protected]> wrote: >>> >>>> Hi Vaghawan, >>>> >>>> I have made that template compatible with the version mentioned >>>> above. Changed versions of engine.json and changed packages name. >>>> >>>> >>>> Regards, >>>> Abhimanyu >>>> >>>> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <[email protected]> >>>> wrote: >>>> >>>>> Hi Abhimanyu, >>>>> >>>>> I don't think this template works with version 0.11.0. As per the >>>>> template : >>>>> >>>>> update for PredictionIO 0.9.2, including: >>>>> >>>>> I don't think it supports the latest pio. You rather switch it to >>>>> 0.9.2 if you want to experiment it. >>>>> >>>>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Vaghawan , >>>>>> >>>>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , >>>>>> Spark - 2.1.0). >>>>>> >>>>>> Regards, >>>>>> Abhimanyu >>>>>> >>>>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Abhimanyu, >>>>>>> >>>>>>> Ok, which version of pio is this? Because the template looks old to >>>>>>> me. >>>>>>> >>>>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Vaghawan, >>>>>>>> >>>>>>>> yes, the spark master connection string is correct I am getting >>>>>>>> executor fails to connect to spark master after 4-5 hrs. >>>>>>>> >>>>>>>> >>>>>>>> Regards, >>>>>>>> Abhimanyu >>>>>>>> >>>>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> It should be correct, as the user got the exception after 3-4 >>>>>>>>> hours of starting. So looks like something else broke. OOM? >>>>>>>>> >>>>>>>>> With Regards, >>>>>>>>> >>>>>>>>> Sachin >>>>>>>>> ⚜KTBFFH⚜ >>>>>>>>> >>>>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> "Executor failed to connect with master ", are you sure the --master >>>>>>>>>> spark://*.*.*.*:7077 is correct? >>>>>>>>>> >>>>>>>>>> Like the one you copied from the spark master's web ui? sometimes >>>>>>>>>> having that wrong fails to connect with the spark master. >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> >>>>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> I am new to predictionIO . I am using template >>>>>>>>>>> https://github.com/EmergentOrder/template-scala-probabilistic-classifier-batch-lbfgs >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>>>>> My training dataset count is 1184603 having approx 6500 >>>>>>>>>>> features. I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, >>>>>>>>>>> 200 GB >>>>>>>>>>> Swap). >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I tried two ways for training >>>>>>>>>>> >>>>>>>>>>> 1. Command ' >>>>>>>>>>> >>>>>>>>>>> > pio train -- --driver-memory 120G --executor-memory 100G -- >>>>>>>>>>> conf >>>>>>>>>>> > spark.network.timeout=10000000 >>>>>>>>>>> >>>>>>>>>>> ' >>>>>>>>>>> Its throwing exception after 3-4 hours. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Exception in thread "main" org.apache.spark.SparkException: >>>>>>>>>>> Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 >>>>>>>>>>> times, most >>>>>>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, >>>>>>>>>>> executor >>>>>>>>>>> driver): ExecutorLostFailure (executor driver exited caused by one >>>>>>>>>>> of the >>>>>>>>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms >>>>>>>>>>> Driver stacktrace: >>>>>>>>>>> at org.apache.spark.scheduler.DAGScheduler.org >>>>>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) >>>>>>>>>>> at >>>>>>>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>>>>>>>>>> at >>>>>>>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) >>>>>>>>>>> at scala.Option.foreach(Option.scala:257) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1353) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) >>>>>>>>>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) >>>>>>>>>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1326) >>>>>>>>>>> at >>>>>>>>>>> org.example.classification.LogisticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWithLBFGSAlgorithm.scala:28) >>>>>>>>>>> at >>>>>>>>>>> org.example.classification.LogisticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWithLBFGSAlgorithm.scala:21) >>>>>>>>>>> at >>>>>>>>>>> org.apache.predictionio.controller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49) >>>>>>>>>>> at >>>>>>>>>>> org.apache.predictionio.controller.Engine$$anonfun$18.apply(Engine.scala:692) >>>>>>>>>>> at >>>>>>>>>>> org.apache.predictionio.controller.Engine$$anonfun$18.apply(Engine.scala:692) >>>>>>>>>>> at >>>>>>>>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) >>>>>>>>>>> at >>>>>>>>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) >>>>>>>>>>> at >>>>>>>>>>> scala.collection.immutable.List.foreach(List.scala:381) >>>>>>>>>>> at >>>>>>>>>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) >>>>>>>>>>> at >>>>>>>>>>> scala.collection.immutable.List.map(List.scala:285) >>>>>>>>>>> at >>>>>>>>>>> org.apache.predictionio.controller.Engine$.train(Engine.scala:692) >>>>>>>>>>> at >>>>>>>>>>> org.apache.predictionio.controller.Engine.train(Engine.scala:177) >>>>>>>>>>> at >>>>>>>>>>> org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67) >>>>>>>>>>> at >>>>>>>>>>> org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:250) >>>>>>>>>>> at >>>>>>>>>>> org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala) >>>>>>>>>>> at >>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>>>>>> at >>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>>>>>>>>> at >>>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) >>>>>>>>>>> at >>>>>>>>>>> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>>>>>>>>>> >>>>>>>>>>> 2. I started spark standalone cluster with 1 master and 3 >>>>>>>>>>> workers and executed the command >>>>>>>>>>> >>>>>>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G >>>>>>>>>>> > --executor-memory 50G >>>>>>>>>>> >>>>>>>>>>> And after some times getting the error . Executor failed to >>>>>>>>>>> connect with master and training gets stopped. >>>>>>>>>>> >>>>>>>>>>> I have changed the feature count from 6500 - > 500 and still the >>>>>>>>>>> condition is same. So can anyone suggest me am I missing something >>>>>>>>>>> >>>>>>>>>>> and In between training getting continuous warnings like : >>>>>>>>>>> [ >>>>>>>>>>> >>>>>>>>>>> > WARN] [ScannerCallable] Ignore, probably already closed >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Abhimanyu >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
