Re: Not able to train data

Abhimanyu Nagrath Thu, 26 Oct 2017 02:22:47 -0700

HI Vaghawan,
Thanks for the reply. Yes already tried that but still its same getting
same error .


Regards,
Abhimanyu

On Thu, Oct 26, 2017 at 2:40 PM, Vaghawan Ojha <vaghawan...@gmail.com>
wrote:

> Hi Abhimanyu,
>
> I've never tried the classification template, So I'm not sure about how
> much time would it exactly take. But as per your error, your model is not
> going any far from stage 1. "Task 0 in stage 1.0 failed 1 times, " .
>
> Probably something to do with the OOMs. https://stackoverflow.
> com/questions/37260230/spark-cluster-full-of-heartbeat-
> timeouts-executors-exiting-on-their-own
>
> did you see this?
>
> On Thu, Oct 26, 2017 at 1:57 PM, Abhimanyu Nagrath <
> abhimanyunagr...@gmail.com> wrote:
>
>> Hi Vaghawan,
>>
>> For debugging I just made a change I just reduced the number if features
>> to 1  record count being the same as 1 Million and hardware is (240 GB RAM
>> , 32 cores and 100 GB SWAP) and training is still going on since 2 hrs.Is
>> it an expected behavior. On which factors does the training time depend.
>>
>>
>> Regards,
>> Abhimanyu
>>
>>
>> On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath <
>> abhimanyunagr...@gmail.com> wrote:
>>
>>> Hi Vaghawan,
>>>
>>> I have made that template compatible with the version mentioned
>>> above. Changed versions of engine.json and changed packages name.
>>>
>>>
>>> Regards,
>>> Abhimanyu
>>>
>>> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <vaghawan...@gmail.com>
>>> wrote:
>>>
>>>> Hi Abhimanyu,
>>>>
>>>> I don't think this template works with version 0.11.0. As per the
>>>> template :
>>>>
>>>> update for PredictionIO 0.9.2, including:
>>>>
>>>> I don't think it supports the latest pio. You rather switch it to 0.9.2
>>>> if you want to experiment it.
>>>>
>>>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <
>>>> abhimanyunagr...@gmail.com> wrote:
>>>>
>>>>> Hi Vaghawan ,
>>>>>
>>>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 ,
>>>>> Spark - 2.1.0).
>>>>>
>>>>> Regards,
>>>>> Abhimanyu
>>>>>
>>>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <vaghawan...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Abhimanyu,
>>>>>>
>>>>>> Ok, which version of pio is this? Because the template looks old to
>>>>>> me.
>>>>>>
>>>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <
>>>>>> abhimanyunagr...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Vaghawan,
>>>>>>>
>>>>>>> yes, the spark master connection string is correct I am getting
>>>>>>> executor fails to connect to spark master after 4-5 hrs.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Abhimanyu
>>>>>>>
>>>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <
>>>>>>> sachinkam...@gmail.com> wrote:
>>>>>>>
>>>>>>>> It should be correct, as the user got the exception after 3-4 hours
>>>>>>>> of starting. So looks like something else broke. OOM?
>>>>>>>>
>>>>>>>> With Regards,
>>>>>>>>
>>>>>>>>      Sachin
>>>>>>>> ⚜KTBFFH⚜
>>>>>>>>
>>>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <
>>>>>>>> vaghawan...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> "Executor failed to connect with master ", are you sure the --master
>>>>>>>>> spark://*.*.*.*:7077 is correct?
>>>>>>>>>
>>>>>>>>> Like the one you copied from the spark master's web ui? sometimes
>>>>>>>>> having that wrong fails to connect with the spark master.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <
>>>>>>>>> abhimanyunagr...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> I am new to predictionIO . I am using template
>>>>>>>>>> https://github.com/EmergentOrder/template-scala-probabilisti
>>>>>>>>>> c-classifier-batch-lbfgs.
>>>>>>>>>>
>>>>>>>>>> My training dataset count is 1184603 having approx 6500 features.
>>>>>>>>>> I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I tried two ways for training
>>>>>>>>>>
>>>>>>>>>>  1. Command '
>>>>>>>>>>
>>>>>>>>>> > pio train -- --driver-memory 120G --executor-memory 100G -- conf
>>>>>>>>>> > spark.network.timeout=10000000
>>>>>>>>>>
>>>>>>>>>> '
>>>>>>>>>>   Its throwing exception after 3-4 hours.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     Exception in thread "main" org.apache.spark.SparkException:
>>>>>>>>>> Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 
>>>>>>>>>> times, most
>>>>>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, 
>>>>>>>>>> executor
>>>>>>>>>> driver): ExecutorLostFailure (executor driver exited caused by one 
>>>>>>>>>> of the
>>>>>>>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms
>>>>>>>>>>     Driver stacktrace:
>>>>>>>>>>             at org.apache.spark.scheduler.DAGScheduler.org
>>>>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>>>>>>>>>> dIndependentStages(DAGScheduler.scala:1435)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>>>>>>>>>             at scala.collection.mutable.Resiz
>>>>>>>>>> ableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>>>>             at scala.collection.mutable.Array
>>>>>>>>>> Buffer.foreach(ArrayBuffer.scala:48)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> Scheduler.abortStage(DAGScheduler.scala:1422)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>>>> scala:802)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>>>> scala:802)
>>>>>>>>>>             at scala.Option.foreach(Option.scala:257)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>>>>>>>>>             at org.apache.spark.util.EventLoo
>>>>>>>>>> p$$anon$1.run(EventLoop.scala:48)
>>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>>> Scheduler.runJob(DAGScheduler.scala:628)
>>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>>> runJob(SparkContext.scala:1918)
>>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>>> runJob(SparkContext.scala:1931)
>>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>>> runJob(SparkContext.scala:1944)
>>>>>>>>>>             at org.apache.spark.rdd.RDD$$anon
>>>>>>>>>> fun$take$1.apply(RDD.scala:1353)
>>>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:151)
>>>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:112)
>>>>>>>>>>             at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>>>>>>>>>>             at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>>>>>>>>>>             at org.example.classification.Log
>>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>>>> thLBFGSAlgorithm.scala:28)
>>>>>>>>>>             at org.example.classification.Log
>>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>>>> thLBFGSAlgorithm.scala:21)
>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)
>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>>             at scala.collection.immutable.Lis
>>>>>>>>>> t.foreach(List.scala:381)
>>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>>> ke$class.map(TraversableLike.scala:234)
>>>>>>>>>>             at scala.collection.immutable.Lis
>>>>>>>>>> t.map(List.scala:285)
>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>> ller.Engine$.train(Engine.scala:692)
>>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>>> ller.Engine.train(Engine.scala:177)
>>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)
>>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250)
>>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>>> ow.CreateWorkflow.main(CreateWorkflow.scala)
>>>>>>>>>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>> Method)
>>>>>>>>>>             at sun.reflect.NativeMethodAccess
>>>>>>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>>>>             at sun.reflect.DelegatingMethodAc
>>>>>>>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>>             at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
>>>>>>>>>> mit.scala:738)
>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187)
>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>> ubmit$.submit(SparkSubmit.scala:212)
>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>> ubmit$.main(SparkSubmit.scala:126)
>>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>>> ubmit.main(SparkSubmit.scala)
>>>>>>>>>>
>>>>>>>>>> 2. I started spark standalone cluster with 1 master and 3 workers
>>>>>>>>>> and executed the command
>>>>>>>>>>
>>>>>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G
>>>>>>>>>> > --executor-memory 50G
>>>>>>>>>>
>>>>>>>>>> And after some times getting the error . Executor failed to
>>>>>>>>>> connect with master and training gets stopped.
>>>>>>>>>>
>>>>>>>>>> I have changed the feature count from 6500 - > 500 and still the
>>>>>>>>>> condition is same. So can anyone suggest me am I missing something
>>>>>>>>>>
>>>>>>>>>> and In between training getting continuous warnings like :
>>>>>>>>>> [
>>>>>>>>>>
>>>>>>>>>> > WARN] [ScannerCallable] Ignore, probably already closed
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Abhimanyu
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Not able to train data

Reply via email to