Re: Not able to train data

Donald Szeto Thu, 26 Oct 2017 09:00:28 -0700

Hi Abhimanyu,

Is there more information from Spark web UI, or pio.log from where you run
the `pio train` command? Also, sharing your full modifications somewhere on
GitHub will be very helpful.


Regards,
Donald

On Thu, Oct 26, 2017 at 2:22 AM Abhimanyu Nagrath <
[email protected]> wrote:

> HI Vaghawan,
> Thanks for the reply. Yes already tried that but still its same getting
> same error .
>
> Regards,
> Abhimanyu
>
> On Thu, Oct 26, 2017 at 2:40 PM, Vaghawan Ojha <[email protected]>
> wrote:
>
>> Hi Abhimanyu,
>>
>> I've never tried the classification template, So I'm not sure about how
>> much time would it exactly take. But as per your error, your model is not
>> going any far from stage 1. "Task 0 in stage 1.0 failed 1 times, " .
>>
>> Probably something to do with the OOMs.
>> https://stackoverflow.com/questions/37260230/spark-cluster-full-of-heartbeat-timeouts-executors-exiting-on-their-own
>>
>>
>> did you see this?
>>
>> On Thu, Oct 26, 2017 at 1:57 PM, Abhimanyu Nagrath <
>> [email protected]> wrote:
>>
>>> Hi Vaghawan,
>>>
>>> For debugging I just made a change I just reduced the number if features
>>> to 1  record count being the same as 1 Million and hardware is (240 GB RAM
>>> , 32 cores and 100 GB SWAP) and training is still going on since 2 hrs.Is
>>> it an expected behavior. On which factors does the training time depend.
>>>
>>>
>>> Regards,
>>> Abhimanyu
>>>
>>>
>>> On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath <
>>> [email protected]> wrote:
>>>
>>>> Hi Vaghawan,
>>>>
>>>> I have made that template compatible with the version mentioned
>>>> above. Changed versions of engine.json and changed packages name.
>>>>
>>>>
>>>> Regards,
>>>> Abhimanyu
>>>>
>>>> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Abhimanyu,
>>>>>
>>>>> I don't think this template works with version 0.11.0. As per the
>>>>> template :
>>>>>
>>>>> update for PredictionIO 0.9.2, including:
>>>>>
>>>>> I don't think it supports the latest pio. You rather switch it to
>>>>> 0.9.2 if you want to experiment it.
>>>>>
>>>>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi Vaghawan ,
>>>>>>
>>>>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 ,
>>>>>> Spark - 2.1.0).
>>>>>>
>>>>>> Regards,
>>>>>> Abhimanyu
>>>>>>
>>>>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Abhimanyu,
>>>>>>>
>>>>>>> Ok, which version of pio is this? Because the template looks old to
>>>>>>> me.
>>>>>>>
>>>>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Vaghawan,
>>>>>>>>
>>>>>>>> yes, the spark master connection string is correct I am getting
>>>>>>>> executor fails to connect to spark master after 4-5 hrs.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Abhimanyu
>>>>>>>>
>>>>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> It should be correct, as the user got the exception after 3-4
>>>>>>>>> hours of starting. So looks like something else broke. OOM?
>>>>>>>>>
>>>>>>>>> With Regards,
>>>>>>>>>
>>>>>>>>>      Sachin
>>>>>>>>> ⚜KTBFFH⚜
>>>>>>>>>
>>>>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> "Executor failed to connect with master ", are you sure the --master
>>>>>>>>>> spark://*.*.*.*:7077 is correct?
>>>>>>>>>>
>>>>>>>>>> Like the one you copied from the spark master's web ui? sometimes
>>>>>>>>>> having that wrong fails to connect with the spark master.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am new to predictionIO . I am using template
>>>>>>>>>>> https://github.com/EmergentOrder/template-scala-probabilistic-classifier-batch-lbfgs
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> My training dataset count is 1184603 having approx 6500
>>>>>>>>>>> features. I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 
>>>>>>>>>>> 200 GB
>>>>>>>>>>> Swap).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I tried two ways for training
>>>>>>>>>>>
>>>>>>>>>>>  1. Command '
>>>>>>>>>>>
>>>>>>>>>>> > pio train -- --driver-memory 120G --executor-memory 100G --
>>>>>>>>>>> conf
>>>>>>>>>>> > spark.network.timeout=10000000
>>>>>>>>>>>
>>>>>>>>>>> '
>>>>>>>>>>>   Its throwing exception after 3-4 hours.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>     Exception in thread "main" org.apache.spark.SparkException:
>>>>>>>>>>> Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 
>>>>>>>>>>> times, most
>>>>>>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, 
>>>>>>>>>>> executor
>>>>>>>>>>> driver): ExecutorLostFailure (executor driver exited caused by one 
>>>>>>>>>>> of the
>>>>>>>>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms
>>>>>>>>>>>     Driver stacktrace:
>>>>>>>>>>>             at org.apache.spark.scheduler.DAGScheduler.org
>>>>>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>>>>>>>>>>             at
>>>>>>>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>>>>>             at
>>>>>>>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>>>>>>>>>>>             at scala.Option.foreach(Option.scala:257)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1353)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>>>>>>>>>>>             at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>>>>>>>>>>>             at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>>>>>>>>>>>             at
>>>>>>>>>>> org.example.classification.LogisticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWithLBFGSAlgorithm.scala:28)
>>>>>>>>>>>             at
>>>>>>>>>>> org.example.classification.LogisticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWithLBFGSAlgorithm.scala:21)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.predictionio.controller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.predictionio.controller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.predictionio.controller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>>>             at
>>>>>>>>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>>>             at
>>>>>>>>>>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>>>             at
>>>>>>>>>>> scala.collection.immutable.List.foreach(List.scala:381)
>>>>>>>>>>>             at
>>>>>>>>>>> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>>>>>>>>>>>             at
>>>>>>>>>>> scala.collection.immutable.List.map(List.scala:285)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.predictionio.controller.Engine$.train(Engine.scala:692)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.predictionio.controller.Engine.train(Engine.scala:177)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:250)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)
>>>>>>>>>>>             at
>>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>>>>>             at
>>>>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>>>>>             at
>>>>>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>>>             at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
>>>>>>>>>>>             at
>>>>>>>>>>> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>>>>>>>>
>>>>>>>>>>> 2. I started spark standalone cluster with 1 master and 3
>>>>>>>>>>> workers and executed the command
>>>>>>>>>>>
>>>>>>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G
>>>>>>>>>>> > --executor-memory 50G
>>>>>>>>>>>
>>>>>>>>>>> And after some times getting the error . Executor failed to
>>>>>>>>>>> connect with master and training gets stopped.
>>>>>>>>>>>
>>>>>>>>>>> I have changed the feature count from 6500 - > 500 and still the
>>>>>>>>>>> condition is same. So can anyone suggest me am I missing something
>>>>>>>>>>>
>>>>>>>>>>> and In between training getting continuous warnings like :
>>>>>>>>>>> [
>>>>>>>>>>>
>>>>>>>>>>> > WARN] [ScannerCallable] Ignore, probably already closed
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Abhimanyu
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Not able to train data

Reply via email to