Hi Vaghawan , I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 , Spark - 2.1.0).
Regards, Abhimanyu On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <[email protected]> wrote: > Hi Abhimanyu, > > Ok, which version of pio is this? Because the template looks old to me. > > On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath < > [email protected]> wrote: > >> Hi Vaghawan, >> >> yes, the spark master connection string is correct I am getting executor >> fails to connect to spark master after 4-5 hrs. >> >> >> Regards, >> Abhimanyu >> >> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <[email protected]> >> wrote: >> >>> It should be correct, as the user got the exception after 3-4 hours of >>> starting. So looks like something else broke. OOM? >>> >>> With Regards, >>> >>> Sachin >>> ⚜KTBFFH⚜ >>> >>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <[email protected]> >>> wrote: >>> >>>> "Executor failed to connect with master ", are you sure the --master >>>> spark://*.*.*.*:7077 is correct? >>>> >>>> Like the one you copied from the spark master's web ui? sometimes >>>> having that wrong fails to connect with the spark master. >>>> >>>> Thanks >>>> >>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath < >>>> [email protected]> wrote: >>>> >>>>> I am new to predictionIO . I am using template >>>>> https://github.com/EmergentOrder/template-scala-probabilisti >>>>> c-classifier-batch-lbfgs. >>>>> >>>>> My training dataset count is 1184603 having approx 6500 features. I am >>>>> using ec2 r4.8xlarge system (240 GB RAM, 32 Cores, 200 GB Swap). >>>>> >>>>> >>>>> I tried two ways for training >>>>> >>>>> 1. Command ' >>>>> >>>>> > pio train -- --driver-memory 120G --executor-memory 100G -- conf >>>>> > spark.network.timeout=10000000 >>>>> >>>>> ' >>>>> Its throwing exception after 3-4 hours. >>>>> >>>>> >>>>> Exception in thread "main" org.apache.spark.SparkException: Job >>>>> aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most >>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost, executor >>>>> driver): ExecutorLostFailure (executor driver exited caused by one of the >>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms >>>>> Driver stacktrace: >>>>> at org.apache.spark.scheduler.DAGScheduler.org >>>>> $apache$spark$scheduler$DAGScheduler$$failJobAn >>>>> dIndependentStages(DAGScheduler.scala:1435) >>>>> at org.apache.spark.scheduler.DAG >>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) >>>>> at org.apache.spark.scheduler.DAG >>>>> Scheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) >>>>> at scala.collection.mutable.Resiz >>>>> ableArray$class.foreach(ResizableArray.scala:59) >>>>> at scala.collection.mutable.Array >>>>> Buffer.foreach(ArrayBuffer.scala:48) >>>>> at org.apache.spark.scheduler.DAG >>>>> Scheduler.abortStage(DAGScheduler.scala:1422) >>>>> at org.apache.spark.scheduler.DAG >>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) >>>>> at org.apache.spark.scheduler.DAG >>>>> Scheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) >>>>> at scala.Option.foreach(Option.scala:257) >>>>> at org.apache.spark.scheduler.DAG >>>>> Scheduler.handleTaskSetFailed(DAGScheduler.scala:802) >>>>> at org.apache.spark.scheduler.DAG >>>>> SchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) >>>>> at org.apache.spark.scheduler.DAG >>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) >>>>> at org.apache.spark.scheduler.DAG >>>>> SchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) >>>>> at org.apache.spark.util.EventLoo >>>>> p$$anon$1.run(EventLoop.scala:48) >>>>> at org.apache.spark.scheduler.DAG >>>>> Scheduler.runJob(DAGScheduler.scala:628) >>>>> at org.apache.spark.SparkContext. >>>>> runJob(SparkContext.scala:1918) >>>>> at org.apache.spark.SparkContext. >>>>> runJob(SparkContext.scala:1931) >>>>> at org.apache.spark.SparkContext. >>>>> runJob(SparkContext.scala:1944) >>>>> at org.apache.spark.rdd.RDD$$anon >>>>> fun$take$1.apply(RDD.scala:1353) >>>>> at org.apache.spark.rdd.RDDOperat >>>>> ionScope$.withScope(RDDOperationScope.scala:151) >>>>> at org.apache.spark.rdd.RDDOperat >>>>> ionScope$.withScope(RDDOperationScope.scala:112) >>>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362) >>>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1326) >>>>> at org.example.classification.Log >>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi >>>>> thLBFGSAlgorithm.scala:28) >>>>> at org.example.classification.Log >>>>> isticRegressionWithLBFGSAlgorithm.train(LogisticRegressionWi >>>>> thLBFGSAlgorithm.scala:21) >>>>> at org.apache.predictionio.contro >>>>> ller.P2LAlgorithm.trainBase(P2LAlgorithm.scala:49) >>>>> at org.apache.predictionio.contro >>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692) >>>>> at org.apache.predictionio.contro >>>>> ller.Engine$$anonfun$18.apply(Engine.scala:692) >>>>> at scala.collection.TraversableLi >>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) >>>>> at scala.collection.TraversableLi >>>>> ke$$anonfun$map$1.apply(TraversableLike.scala:234) >>>>> at scala.collection.immutable.List.foreach(List.scala:381) >>>>> at scala.collection.TraversableLi >>>>> ke$class.map(TraversableLike.scala:234) >>>>> at scala.collection.immutable.List.map(List.scala:285) >>>>> at org.apache.predictionio.contro >>>>> ller.Engine$.train(Engine.scala:692) >>>>> at org.apache.predictionio.contro >>>>> ller.Engine.train(Engine.scala:177) >>>>> at org.apache.predictionio.workfl >>>>> ow.CoreWorkflow$.runTrain(CoreWorkflow.scala:67) >>>>> at org.apache.predictionio.workfl >>>>> ow.CreateWorkflow$.main(CreateWorkflow.scala:250) >>>>> at org.apache.predictionio.workfl >>>>> ow.CreateWorkflow.main(CreateWorkflow.scala) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>>>> Method) >>>>> at sun.reflect.NativeMethodAccess >>>>> orImpl.invoke(NativeMethodAccessorImpl.java:62) >>>>> at sun.reflect.DelegatingMethodAc >>>>> cessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:498) >>>>> at org.apache.spark.deploy.SparkS >>>>> ubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSub >>>>> mit.scala:738) >>>>> at org.apache.spark.deploy.SparkS >>>>> ubmit$.doRunMain$1(SparkSubmit.scala:187) >>>>> at org.apache.spark.deploy.SparkS >>>>> ubmit$.submit(SparkSubmit.scala:212) >>>>> at org.apache.spark.deploy.SparkS >>>>> ubmit$.main(SparkSubmit.scala:126) >>>>> at org.apache.spark.deploy.SparkS >>>>> ubmit.main(SparkSubmit.scala) >>>>> >>>>> 2. I started spark standalone cluster with 1 master and 3 workers and >>>>> executed the command >>>>> >>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G >>>>> > --executor-memory 50G >>>>> >>>>> And after some times getting the error . Executor failed to connect >>>>> with master and training gets stopped. >>>>> >>>>> I have changed the feature count from 6500 - > 500 and still the >>>>> condition is same. So can anyone suggest me am I missing something >>>>> >>>>> and In between training getting continuous warnings like : >>>>> [ >>>>> >>>>> > WARN] [ScannerCallable] Ignore, probably already closed >>>>> >>>>> >>>>> Regards, >>>>> Abhimanyu >>>>> >>>>> >>>> >>> >> >
