Re: Not able to train data

Seshachalam Malisetti Thu, 26 Oct 2017 00:10:29 -0700

how do unsubscribe from this list ? please help

Sent from Nylas Mail, the best free email app for work

On Oct 26 2017, at 12:39 pm, Vaghawan Ojha <[email protected]> wrote:

Hi Abhimanyu,

I don't think this template works with version 0.11.0. As per the template :

update for PredictionIO 0.9.2, including:

I don't think it supports the latest pio. You rather switch it to 0.9.2 if you want to experiment it.

On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <[email protected]> wrote:
Hi Vaghawan ,

I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , Spark - 2.1.0).

Regards,
Abhimanyu

On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <[email protected]> wrote:
Hi Abhimanyu,

Ok, which version of pio is this? Because the template looks old to me.

On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <[email protected]> wrote:
Hi Vaghawan,

yes, the spark master connection string is correct I am getting executor fails to connect to spark master after 4-5 hrs.

Regards,
Abhimanyu

On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <[email protected]> wrote:
It should be correct, as the user got the exception after 3-4 hours of starting. So looks like something else broke. OOM?

With Regards,

Sachin
⚜KTBFFH⚜

On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <[email protected]> wrote:
"Executor failed to connect with master ", are you sure the --master spark://*.*.*.*:7077 is correct?

Like the one you copied from the spark master's web ui? sometimes having that wrong fails to connect with the spark master.

Thanks

On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <[email protected]> wrote:
I am new to predictionIO . I am using template https://github.com/EmergentOrder/template-scala-probabilistic-classifier-batch-lbfgs.

My training dataset count is 1184603 having approx 6500 features. I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap).

I tried two ways for training

1. Command '

> pio train -- --driver-memory 120G --executor-memory 100G -- conf
> spark.network.timeout=10000000

'
Its throwing exception after 3-4 hours.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, executor driver): ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 181529 ms
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1353)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
at org.example.classification.LogisticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWithLBFGSAlgorithm.scala:28)
at org.example.classification.LogisticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWithLBFGSAlgorithm.scala:21)
at org.apache.predictionio.controller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49)
at org.apache.predictionio.controller.Engine$$anonfun$18.apply(Engine.scala:692)
at org.apache.predictionio.controller.Engine$$anonfun$18.apply(Engine.scala:692)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.predictionio.controller.Engine$.train(Engine.scala:692)
at org.apache.predictionio.controller.Engine.train(Engine.scala:177)
at org.apache.predictionio.workflow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67)
at org.apache.predictionio.workflow.CreateWorkflow$.main(CreateWorkflow.scala:250)
at org.apache.predictionio.workflow.CreateWorkflow.main(CreateWorkflow.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

2. I started spark standalone cluster with 1 master and 3 workers and executed the command

> pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G
> --executor-memory 50G

And after some times getting the error . Executor failed to connect with master and training gets stopped.

I have changed the feature count from 6500 - > 500 and still the condition is same. So can anyone suggest me am I missing something

and In between training getting continuous warnings like :
[

> WARN] [ScannerCallable] Ignore, probably already closed

Regards,
Abhimanyu

Re: Not able to train data

Reply via email to