Re: Not able to train data

Vaghawan Ojha Thu, 26 Oct 2017 00:09:27 -0700

Hi Abhimanyu,

I don't think this template works with version 0.11.0. As per the template
:


update for PredictionIO 0.9.2, including:

I don't think it supports the latest pio. You rather switch it to 0.9.2 if
you want to experiment it.

On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <
[email protected]> wrote:

> Hi Vaghawan ,
>
> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , Spark -
> 2.1.0).
>
> Regards,
> Abhimanyu
>
> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <[email protected]>
> wrote:
>
>> Hi Abhimanyu,
>>
>> Ok, which version of pio is this? Because the template looks old to me.
>>
>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <
>> [email protected]> wrote:
>>
>>> Hi Vaghawan,
>>>
>>> yes, the spark master connection string is correct I am getting executor
>>> fails to connect to spark master after 4-5 hrs.
>>>
>>>
>>> Regards,
>>> Abhimanyu
>>>
>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <[email protected]>
>>> wrote:
>>>
>>>> It should be correct, as the user got the exception after 3-4 hours of
>>>> starting. So looks like something else broke. OOM?
>>>>
>>>> With Regards,
>>>>
>>>>      Sachin
>>>> ⚜KTBFFH⚜
>>>>
>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <[email protected]>
>>>> wrote:
>>>>
>>>>> "Executor failed to connect with master ", are you sure the --master
>>>>> spark://*.*.*.*:7077 is correct?
>>>>>
>>>>> Like the one you copied from the spark master's web ui? sometimes
>>>>> having that wrong fails to connect with the spark master.
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I am new to predictionIO . I am using template
>>>>>> https://github.com/EmergentOrder/template-scala-probabilisti
>>>>>> c-classifier-batch-lbfgs.
>>>>>>
>>>>>> My training dataset count is 1184603 having approx 6500 features. I
>>>>>> am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap).
>>>>>>
>>>>>>
>>>>>> I tried two ways for training
>>>>>>
>>>>>>  1. Command '
>>>>>>
>>>>>> > pio train -- --driver-memory 120G --executor-memory 100G -- conf
>>>>>> > spark.network.timeout=10000000
>>>>>>
>>>>>> '
>>>>>>   Its throwing exception after 3-4 hours.
>>>>>>
>>>>>>
>>>>>>     Exception in thread "main" org.apache.spark.SparkException: Job
>>>>>> aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most
>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, executor
>>>>>> driver): ExecutorLostFailure (executor driver exited caused by one of the
>>>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms
>>>>>>     Driver stacktrace:
>>>>>>             at org.apache.spark.scheduler.DAGScheduler.org
>>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn
>>>>>> dIndependentStages(DAGScheduler.scala:1435)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>>>>>>             at scala.collection.mutable.Resiz
>>>>>> ableArray$class.foreach(ResizableArray.scala:59)
>>>>>>             at scala.collection.mutable.Array
>>>>>> Buffer.foreach(ArrayBuffer.scala:48)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> Scheduler.abortStage(DAGScheduler.scala:1422)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>> scala:802)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.
>>>>>> scala:802)
>>>>>>             at scala.Option.foreach(Option.scala:257)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>>>>>             at org.apache.spark.util.EventLoo
>>>>>> p$$anon$1.run(EventLoop.scala:48)
>>>>>>             at org.apache.spark.scheduler.DAG
>>>>>> Scheduler.runJob(DAGScheduler.scala:628)
>>>>>>             at org.apache.spark.SparkContext.
>>>>>> runJob(SparkContext.scala:1918)
>>>>>>             at org.apache.spark.SparkContext.
>>>>>> runJob(SparkContext.scala:1931)
>>>>>>             at org.apache.spark.SparkContext.
>>>>>> runJob(SparkContext.scala:1944)
>>>>>>             at org.apache.spark.rdd.RDD$$anon
>>>>>> fun$take$1.apply(RDD.scala:1353)
>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>> ionScope$.withScope(RDDOperationScope.scala:151)
>>>>>>             at org.apache.spark.rdd.RDDOperat
>>>>>> ionScope$.withScope(RDDOperationScope.scala:112)
>>>>>>             at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>>>>>>             at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>>>>>>             at org.example.classification.Log
>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>> thLBFGSAlgorithm.scala:28)
>>>>>>             at org.example.classification.Log
>>>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi
>>>>>> thLBFGSAlgorithm.scala:21)
>>>>>>             at org.apache.predictionio.contro
>>>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)
>>>>>>             at org.apache.predictionio.contro
>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>             at org.apache.predictionio.contro
>>>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>             at scala.collection.TraversableLi
>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>             at scala.collection.TraversableLi
>>>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234)
>>>>>>             at scala.collection.immutable.Lis
>>>>>> t.foreach(List.scala:381)
>>>>>>             at scala.collection.TraversableLi
>>>>>> ke$class.map(TraversableLike.scala:234)
>>>>>>             at scala.collection.immutable.List.map(List.scala:285)
>>>>>>             at org.apache.predictionio.contro
>>>>>> ller.Engine$.train(Engine.scala:692)
>>>>>>             at org.apache.predictionio.contro
>>>>>> ller.Engine.train(Engine.scala:177)
>>>>>>             at org.apache.predictionio.workfl
>>>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)
>>>>>>             at org.apache.predictionio.workfl
>>>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250)
>>>>>>             at org.apache.predictionio.workfl
>>>>>> ow.CreateWorkflow.main(CreateWorkflow.scala)
>>>>>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>> Method)
>>>>>>             at sun.reflect.NativeMethodAccess
>>>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>>             at sun.reflect.DelegatingMethodAc
>>>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>             at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub
>>>>>> mit.scala:738)
>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187)
>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>> ubmit$.submit(SparkSubmit.scala:212)
>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>> ubmit$.main(SparkSubmit.scala:126)
>>>>>>             at org.apache.spark.deploy.SparkS
>>>>>> ubmit.main(SparkSubmit.scala)
>>>>>>
>>>>>> 2. I started spark standalone cluster with 1 master and 3 workers and
>>>>>> executed the command
>>>>>>
>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G
>>>>>> > --executor-memory 50G
>>>>>>
>>>>>> And after some times getting the error . Executor failed to connect
>>>>>> with master and training gets stopped.
>>>>>>
>>>>>> I have changed the feature count from 6500 - > 500 and still the
>>>>>> condition is same. So can anyone suggest me am I missing something
>>>>>>
>>>>>> and In between training getting continuous warnings like :
>>>>>> [
>>>>>>
>>>>>> > WARN] [ScannerCallable] Ignore, probably already closed
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Abhimanyu
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Not able to train data

Reply via email to