Re: Not able to train data

Vaghawan Ojha Thu, 26 Oct 2017 02:10:57 -0700

Hi Abhimanyu,

I've never tried the classification template, So I'm not sure about how
much time would it exactly take. But as per your error, your model is not
going any far from stage 1. "Task 0 in stage 1.0 failed 1 times, " .


Probably something to do with the OOMs.
https://stackoverflow.com/questions/37260230/spark-cluster-full-of-heartbeat-timeouts-executors-exiting-on-their-own


did you see this?

On Thu, Oct 26, 2017 at 1:57 PM, Abhimanyu Nagrath <
abhimanyunagr...@gmail.com> wrote:

> Hi Vaghawan,
>
> For debugging I just made a change I just reduced the number if features
> to 1  record count being the same as 1 Million and hardware is (240 GB RAM
> , 32 cores and 100 GB SWAP) and training is still going on since 2 hrs.Is
> it an expected behavior. On which factors does the training time depend.
>
>
> Regards,
> Abhimanyu
>
>
> On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath <
> abhimanyunagr...@gmail.com> wrote:
>
>> Hi Vaghawan,
>>
>> I have made that template compatible with the version mentioned
>> above. Changed versions of engine.json and changed packages name.
>>
>>
>> Regards,
>> Abhimanyu
>>
>> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <vaghawan...@gmail.com>
>> wrote:
>>
>>> Hi Abhimanyu,
>>>
>>> I don't think this template works with version 0.11.0. As per the
>>> template :
>>>
>>> update for PredictionIO 0.9.2, including:
>>>
>>> I don't think it supports the latest pio. You rather switch it to 0.9.2
>>> if you want to experiment it.
>>>
>>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <
>>> abhimanyunagr...@gmail.com> wrote:
>>>
>>>> Hi Vaghawan ,
>>>>
>>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , Spark
>>>> - 2.1.0).
>>>>
>>>> Regards,
>>>> Abhimanyu
>>>>
>>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <vaghawan...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Abhimanyu,
>>>>>
>>>>> Ok, which version of pio is this? Because the template looks old to
>>>>> me.
>>>>>
>>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <
>>>>> abhimanyunagr...@gmail.com> wrote:
>>>>>
>>>>>> Hi Vaghawan,
>>>>>>
>>>>>> yes, the spark master connection string is correct I am getting
>>>>>> executor fails to connect to spark master after 4-5 hrs.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Abhimanyu
>>>>>>
>>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <
>>>>>> sachinkam...@gmail.com> wrote:
>>>>>>
>>>>>>> It should be correct, as the user got the exception after 3-4 hours
>>>>>>> of starting. So looks like something else broke. OOM?
>>>>>>>
>>>>>>> With Regards,
>>>>>>>
>>>>>>>      Sachin
>>>>>>> ⚜KTBFFH⚜
>>>>>>>
>>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <
>>>>>>> vaghawan...@gmail.com> wrote:
>>>>>>>
>>>>>>>> "Executor failed to connect with master ", are you sure the --master
>>>>>>>> spark://*.*.*.*:7077 is correct?
>>>>>>>>
>>>>>>>> Like the one you copied from the spark master's web ui? sometimes
>>>>>>>> having that wrong fails to connect with the spark master.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <
>>>>>>>> abhimanyunagr...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I am new to predictionIO . I am using template
>>>>>>>>> https://github.com/EmergentOrder/template-scala-probabilisti
>>>>>>>>> c-classifier-batch-lbfgs.
>>>>>>>>>
>>>>>>>>> My training dataset count is 1184603 having approx 6500 features.
>>>>>>>>> I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I tried two ways for training
>>>>>>>>>
>>>>>>>>>  1. Command '
>>>>>>>>>
>>>>>>>>> > pio train -- --driver-memory 120G --executor-memory 100G -- conf
>>>>>>>>> > spark.network.timeout=10000000
>>>>>>>>>
>>>>>>>>> '
>>>>>>>>>   Its throwing exception after 3-4 hours.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     Exception in thread "main" org.apache.spark.SparkException:
>>>>>>>>> Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, 
>>>>>>>>> most
>>>>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, 
>>>>>>>>> executor
>>>>>>>>> driver): ExecutorLostFailure (executor driver exited caused by one of 
>>>>>>>>> the
>>>>>>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms
>>>>>>>>>     Driver stacktrace:
>>>>>>>>>             at org.apache.spark.scheduler.DAGScheduler.org
>>>>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>>>>>>>>> dIndependentStages(DAGScheduler.scala:1435)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>>>>>>>>             at scala.collection.mutable.Resiz
>>>>>>>>> ableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>>>             at scala.collection.mutable.Array
>>>>>>>>> Buffer.foreach(ArrayBuffer.scala:48)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler.abortStage(DAGScheduler.scala:1422)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>>> scala:802)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>>>>> scala:802)
>>>>>>>>>             at scala.Option.foreach(Option.scala:257)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>>>>>>>>             at org.apache.spark.util.EventLoo
>>>>>>>>> p$$anon$1.run(EventLoop.scala:48)
>>>>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>>>>> Scheduler.runJob(DAGScheduler.scala:628)
>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>> runJob(SparkContext.scala:1918)
>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>> runJob(SparkContext.scala:1931)
>>>>>>>>>             at org.apache.spark.SparkContext.
>>>>>>>>> runJob(SparkContext.scala:1944)
>>>>>>>>>             at org.apache.spark.rdd.RDD$$anon
>>>>>>>>> fun$take$1.apply(RDD.scala:1353)
>>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:151)
>>>>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>>>>> ionScope$.withScope(RDDOperationScope.scala:112)
>>>>>>>>>             at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>>>>>>>>>             at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>>>>>>>>>             at org.example.classification.Log
>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>>> thLBFGSAlgorithm.scala:28)
>>>>>>>>>             at org.example.classification.Log
>>>>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>>>>> thLBFGSAlgorithm.scala:21)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>>>>             at scala.collection.immutable.Lis
>>>>>>>>> t.foreach(List.scala:381)
>>>>>>>>>             at scala.collection.TraversableLi
>>>>>>>>> ke$class.map(TraversableLike.scala:234)
>>>>>>>>>             at scala.collection.immutable.List.map(List.scala:285)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.Engine$.train(Engine.scala:692)
>>>>>>>>>             at org.apache.predictionio.contro
>>>>>>>>> ller.Engine.train(Engine.scala:177)
>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)
>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250)
>>>>>>>>>             at org.apache.predictionio.workfl
>>>>>>>>> ow.CreateWorkflow.main(CreateWorkflow.scala)
>>>>>>>>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>> Method)
>>>>>>>>>             at sun.reflect.NativeMethodAccess
>>>>>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>>>>             at sun.reflect.DelegatingMethodAc
>>>>>>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>             at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
>>>>>>>>> mit.scala:738)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit$.submit(SparkSubmit.scala:212)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit$.main(SparkSubmit.scala:126)
>>>>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>>>>> ubmit.main(SparkSubmit.scala)
>>>>>>>>>
>>>>>>>>> 2. I started spark standalone cluster with 1 master and 3 workers
>>>>>>>>> and executed the command
>>>>>>>>>
>>>>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G
>>>>>>>>> > --executor-memory 50G
>>>>>>>>>
>>>>>>>>> And after some times getting the error . Executor failed to
>>>>>>>>> connect with master and training gets stopped.
>>>>>>>>>
>>>>>>>>> I have changed the feature count from 6500 - > 500 and still the
>>>>>>>>> condition is same. So can anyone suggest me am I missing something
>>>>>>>>>
>>>>>>>>> and In between training getting continuous warnings like :
>>>>>>>>> [
>>>>>>>>>
>>>>>>>>> > WARN] [ScannerCallable] Ignore, probably already closed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Abhimanyu
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Not able to train data

Reply via email to