Hi Donald,
Checked pio.log and found the following error while training :
1. ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher [main] -
hconnection-0x21325036, quorum=localhost:2181, baseZNode=/hbase Received
unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.checkExists(ZKUtil.java:479)
at
org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:65)
at
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:83)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.retrieveClusterId(HConnectionManager.java:852)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:657)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:409)
at
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:388)
at
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:269)
at
org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.java:2338)
at
org.apache.predictionio.data.storage.hbase.StorageClient.<init>(StorageClient.scala:53)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at
org.apache.predictionio.data.storage.Storage$.getClient(Storage.scala:223)
at
org.apache.predictionio.data.storage.Storage$.org$apache$predictionio$data$storage$Storage$$updateS2CM(Storage.scala:254)
at
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:215)
at
org.apache.predictionio.data.storage.Storage$$anonfun$sourcesToClientMeta$1.apply(Storage.scala:215)
at
scala.collection.mutable.MapLike$class.getOrElseUpdate(MapLike.scala:194)
at
scala.collection.mutable.AbstractMap.getOrElseUpdate(Map.scala:80)
at
org.apache.predictionio.data.storage.Storage$.sourcesToClientMeta(Storage.scala:215)
at
org.apache.predictionio.data.storage.Storage$.getDataObject(Storage.scala:284)
at
org.apache.predictionio.data.storage.Storage$.getDataObjectFromRepo(Storage.scala:269)
at
org.apache.predictionio.data.storage.Storage$.getLEvents(Storage.scala:417)
at
org.apache.predictionio.data.api.EventServer$.createEventServer(EventServer.scala:616)
at
org.apache.predictionio.tools.commands.Management$.eventserver(Management.scala:77)
at
org.apache.predictionio.tools.console.Pio$.eventserver(Pio.scala:113)
at
org.apache.predictionio.tools.console.Console$$anonfun$main$1.apply(Console.scala:650)
at
org.apache.predictionio.tools.console.Console$$anonfun$main$1.apply(Console.scala:611)
at scala.Option.map(Option.scala:146)
at
org.apache.predictionio.tools.console.Console$.main(Console.scala:611)
/2017-10-25
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2196)
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:112)
at
org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:133)
at
org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:108)
at java.lang.Thread.run(Thread.java:748)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1457)
at
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1661)
at
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1719)
at
org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:29990)
at
org.apache.hadoop.hbase.client.ScannerCallable.close(ScannerCallable.java:291)
at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:160)
at
org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:59)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:114)
at
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:90)
at
org.apache.hadoop.hbase.client.ClientScanner.close(ClientScanner.java:450)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.restart(TableRecordReaderImpl.java:88)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:229)
at
org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:138)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:199)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2. ERROR org.apache.spark.scheduler.TaskSchedulerImpl
[dispatcher-event-loop-26] - Lost executor driver on localhost: Executor
heartbeat timed out after 181529 ms
3. 2017-10-25 22:46:19,747 ERROR org.apache.spark.storage.BlockManager
[driver-heartbeater] - Failed to report rdd_8_0 to master; giving up.
4. ERROR org.apache.spark.scheduler.LiveListenerBus [Thread-3] -
SparkListenerBus has already stopped! Dropping event
SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@5621bf01)
Are these due to resource limitation ?
On Thu, Oct 26, 2017 at 9:30 PM, Donald Szeto <[email protected]> wrote:
> Hi Abhimanyu,
>
> Is there more information from Spark web UI, or pio.log from where you run
> the `pio train` command? Also, sharing your full modifications somewhere on
> GitHub will be very helpful.
>
> Regards,
> Donald
>
> On Thu, Oct 26, 2017 at 2:22 AM Abhimanyu Nagrath <
> [email protected]> wrote:
>
>> HI Vaghawan,
>> Thanks for the reply. Yes already tried that but still its same getting
>> same error .
>>
>> Regards,
>> Abhimanyu
>>
>> On Thu, Oct 26, 2017 at 2:40 PM, Vaghawan Ojha <[email protected]>
>> wrote:
>>
>>> Hi Abhimanyu,
>>>
>>> I've never tried the classification template, So I'm not sure about how
>>> much time would it exactly take. But as per your error, your model is not
>>> going any far from stage 1. "Task 0 in stage 1.0 failed 1 times, " .
>>>
>>> Probably something to do with the OOMs. https://stackoverflow.
>>> com/questions/37260230/spark-cluster-full-of-heartbeat-
>>> timeouts-executors-exiting-on-their-own
>>>
>>> did you see this?
>>>
>>> On Thu, Oct 26, 2017 at 1:57 PM, Abhimanyu Nagrath <
>>> [email protected]> wrote:
>>>
>>>> Hi Vaghawan,
>>>>
>>>> For debugging I just made a change I just reduced the number if
>>>> features to 1 record count being the same as 1 Million and hardware is
>>>> (240 GB RAM , 32 cores and 100 GB SWAP) and training is still going on
>>>> since 2 hrs.Is it an expected behavior. On which factors does the training
>>>> time depend.
>>>>
>>>>
>>>> Regards,
>>>> Abhimanyu
>>>>
>>>>
>>>> On Thu, Oct 26, 2017 at 12:41 PM, Abhimanyu Nagrath <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Vaghawan,
>>>>>
>>>>> I have made that template compatible with the version mentioned
>>>>> above. Changed versions of engine.json and changed packages name.
>>>>>
>>>>>
>>>>> Regards,
>>>>> Abhimanyu
>>>>>
>>>>> On Thu, Oct 26, 2017 at 12:39 PM, Vaghawan Ojha <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi Abhimanyu,
>>>>>>
>>>>>> I don't think this template works with version 0.11.0. As per the
>>>>>> template :
>>>>>>
>>>>>> update for PredictionIO 0.9.2, including:
>>>>>>
>>>>>> I don't think it supports the latest pio. You rather switch it to
>>>>>> 0.9.2 if you want to experiment it.
>>>>>>
>>>>>> On Thu, Oct 26, 2017 at 12:52 PM, Abhimanyu Nagrath <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Vaghawan ,
>>>>>>>
>>>>>>> I am using v0.11.0-incubating with (ES - v5.2.1 , Hbase - 1.2.6 ,
>>>>>>> Spark - 2.1.0).
>>>>>>>
>>>>>>> Regards,
>>>>>>> Abhimanyu
>>>>>>>
>>>>>>> On Thu, Oct 26, 2017 at 12:31 PM, Vaghawan Ojha <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Abhimanyu,
>>>>>>>>
>>>>>>>> Ok, which version of pio is this? Because the template looks old to
>>>>>>>> me.
>>>>>>>>
>>>>>>>> On Thu, Oct 26, 2017 at 12:44 PM, Abhimanyu Nagrath <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Vaghawan,
>>>>>>>>>
>>>>>>>>> yes, the spark master connection string is correct I am getting
>>>>>>>>> executor fails to connect to spark master after 4-5 hrs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Abhimanyu
>>>>>>>>>
>>>>>>>>> On Thu, Oct 26, 2017 at 12:17 PM, Sachin Kamkar <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> It should be correct, as the user got the exception after 3-4
>>>>>>>>>> hours of starting. So looks like something else broke. OOM?
>>>>>>>>>>
>>>>>>>>>> With Regards,
>>>>>>>>>>
>>>>>>>>>> Sachin
>>>>>>>>>> ⚜KTBFFH⚜
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 26, 2017 at 12:15 PM, Vaghawan Ojha <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> "Executor failed to connect with master ", are you sure the --master
>>>>>>>>>>> spark://*.*.*.*:7077 is correct?
>>>>>>>>>>>
>>>>>>>>>>> Like the one you copied from the spark master's web ui?
>>>>>>>>>>> sometimes having that wrong fails to connect with the spark master.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Oct 26, 2017 at 12:02 PM, Abhimanyu Nagrath <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I am new to predictionIO . I am using template
>>>>>>>>>>>> https://github.com/EmergentOrder/template-scala-
>>>>>>>>>>>> probabilistic-classifier-batch-lbfgs.
>>>>>>>>>>>>
>>>>>>>>>>>> My training dataset count is 1184603 having approx 6500
>>>>>>>>>>>> features. I am using ec2 r4.8xlarge system (240 GB RAM, 32 Cores,
>>>>>>>>>>>> 200 GB
>>>>>>>>>>>> Swap).
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I tried two ways for training
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Command '
>>>>>>>>>>>>
>>>>>>>>>>>> > pio train -- --driver-memory 120G --executor-memory 100G --
>>>>>>>>>>>> conf
>>>>>>>>>>>> > spark.network.timeout=10000000
>>>>>>>>>>>>
>>>>>>>>>>>> '
>>>>>>>>>>>> Its throwing exception after 3-4 hours.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Exception in thread "main" org.apache.spark.SparkException:
>>>>>>>>>>>> Job aborted due to stage failure: Task 0 in stage 1.0 failed 1
>>>>>>>>>>>> times, most
>>>>>>>>>>>> recent failure: Lost task 0.0 in stage 1.0 (TID 15, localhost,
>>>>>>>>>>>> executor
>>>>>>>>>>>> driver): ExecutorLostFailure (executor driver exited caused by one
>>>>>>>>>>>> of the
>>>>>>>>>>>> running tasks) Reason: Executor heartbeat timed out after 181529 ms
>>>>>>>>>>>> Driver stacktrace:
>>>>>>>>>>>> at org.apache.spark.scheduler.DAGScheduler.org
>>>>>>>>>>>> $apache$spark$scheduler$DAGScheduler$$
>>>>>>>>>>>> failJobAndIndependentStages(DAGScheduler.scala:1435)
>>>>>>>>>>>> at org.apache.spark.scheduler.
>>>>>>>>>>>> DAGScheduler$$anonfun$abortStage$1.apply(
>>>>>>>>>>>> DAGScheduler.scala:1423)
>>>>>>>>>>>> at org.apache.spark.scheduler.
>>>>>>>>>>>> DAGScheduler$$anonfun$abortStage$1.apply(
>>>>>>>>>>>> DAGScheduler.scala:1422)
>>>>>>>>>>>> at scala.collection.mutable.
>>>>>>>>>>>> ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>>>>>>> at scala.collection.mutable.ArrayBuffer.foreach(
>>>>>>>>>>>> ArrayBuffer.scala:48)
>>>>>>>>>>>> at org.apache.spark.scheduler.
>>>>>>>>>>>> DAGScheduler.abortStage(DAGScheduler.scala:1422)
>>>>>>>>>>>> at org.apache.spark.scheduler.
>>>>>>>>>>>> DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(
>>>>>>>>>>>> DAGScheduler.scala:802)
>>>>>>>>>>>> at org.apache.spark.scheduler.
>>>>>>>>>>>> DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(
>>>>>>>>>>>> DAGScheduler.scala:802)
>>>>>>>>>>>> at scala.Option.foreach(Option.scala:257)
>>>>>>>>>>>> at org.apache.spark.scheduler.DAGScheduler.
>>>>>>>>>>>> handleTaskSetFailed(DAGScheduler.scala:802)
>>>>>>>>>>>> at org.apache.spark.scheduler.
>>>>>>>>>>>> DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.
>>>>>>>>>>>> scala:1650)
>>>>>>>>>>>> at org.apache.spark.scheduler.
>>>>>>>>>>>> DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>>>>>>>>>>>> at org.apache.spark.scheduler.
>>>>>>>>>>>> DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>>>>>>>>>>>> at org.apache.spark.util.EventLoop$$anon$1.run(
>>>>>>>>>>>> EventLoop.scala:48)
>>>>>>>>>>>> at org.apache.spark.scheduler.DAGScheduler.runJob(
>>>>>>>>>>>> DAGScheduler.scala:628)
>>>>>>>>>>>> at org.apache.spark.SparkContext.
>>>>>>>>>>>> runJob(SparkContext.scala:1918)
>>>>>>>>>>>> at org.apache.spark.SparkContext.
>>>>>>>>>>>> runJob(SparkContext.scala:1931)
>>>>>>>>>>>> at org.apache.spark.SparkContext.
>>>>>>>>>>>> runJob(SparkContext.scala:1944)
>>>>>>>>>>>> at org.apache.spark.rdd.RDD$$
>>>>>>>>>>>> anonfun$take$1.apply(RDD.scala:1353)
>>>>>>>>>>>> at org.apache.spark.rdd.
>>>>>>>>>>>> RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>>>>>>>>>>> at org.apache.spark.rdd.
>>>>>>>>>>>> RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>>>>>>>>>>>> at org.apache.spark.rdd.RDD.
>>>>>>>>>>>> withScope(RDD.scala:362)
>>>>>>>>>>>> at org.apache.spark.rdd.RDD.take(RDD.scala:1326)
>>>>>>>>>>>> at org.example.classification.
>>>>>>>>>>>> LogisticRegressionWithLBFGSAlgorithm.train(
>>>>>>>>>>>> LogisticRegressionWithLBFGSAlgorithm.scala:28)
>>>>>>>>>>>> at org.example.classification.
>>>>>>>>>>>> LogisticRegressionWithLBFGSAlgorithm.train(
>>>>>>>>>>>> LogisticRegressionWithLBFGSAlgorithm.scala:21)
>>>>>>>>>>>> at org.apache.predictionio.controller.P2LAlgorithm.
>>>>>>>>>>>> trainBase(P2LAlgorithm.scala:49)
>>>>>>>>>>>> at org.apache.predictionio.
>>>>>>>>>>>> controller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>>>> at org.apache.predictionio.
>>>>>>>>>>>> controller.Engine$$anonfun$18.apply(Engine.scala:692)
>>>>>>>>>>>> at scala.collection.TraversableLike$$anonfun$map$
>>>>>>>>>>>> 1.apply(TraversableLike.scala:234)
>>>>>>>>>>>> at scala.collection.TraversableLike$$anonfun$map$
>>>>>>>>>>>> 1.apply(TraversableLike.scala:234)
>>>>>>>>>>>> at scala.collection.immutable.
>>>>>>>>>>>> List.foreach(List.scala:381)
>>>>>>>>>>>> at scala.collection.TraversableLike$class.map(
>>>>>>>>>>>> TraversableLike.scala:234)
>>>>>>>>>>>> at scala.collection.immutable.
>>>>>>>>>>>> List.map(List.scala:285)
>>>>>>>>>>>> at org.apache.predictionio.
>>>>>>>>>>>> controller.Engine$.train(Engine.scala:692)
>>>>>>>>>>>> at org.apache.predictionio.controller.Engine.train(
>>>>>>>>>>>> Engine.scala:177)
>>>>>>>>>>>> at org.apache.predictionio.workflow.CoreWorkflow$.
>>>>>>>>>>>> runTrain(CoreWorkflow.scala:67)
>>>>>>>>>>>> at org.apache.predictionio.
>>>>>>>>>>>> workflow.CreateWorkflow$.main(CreateWorkflow.scala:250)
>>>>>>>>>>>> at org.apache.predictionio.
>>>>>>>>>>>> workflow.CreateWorkflow.main(CreateWorkflow.scala)
>>>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>>>>>>>>>>> Method)
>>>>>>>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke(
>>>>>>>>>>>> NativeMethodAccessorImpl.java:62)
>>>>>>>>>>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>>>>>>>>>>> DelegatingMethodAccessorImpl.java:43)
>>>>>>>>>>>> at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>>>>>>>> at org.apache.spark.deploy.
>>>>>>>>>>>> SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(
>>>>>>>>>>>> SparkSubmit.scala:738)
>>>>>>>>>>>> at org.apache.spark.deploy.
>>>>>>>>>>>> SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
>>>>>>>>>>>> at org.apache.spark.deploy.SparkSubmit$.submit(
>>>>>>>>>>>> SparkSubmit.scala:212)
>>>>>>>>>>>> at org.apache.spark.deploy.
>>>>>>>>>>>> SparkSubmit$.main(SparkSubmit.scala:126)
>>>>>>>>>>>> at org.apache.spark.deploy.
>>>>>>>>>>>> SparkSubmit.main(SparkSubmit.scala)
>>>>>>>>>>>>
>>>>>>>>>>>> 2. I started spark standalone cluster with 1 master and 3
>>>>>>>>>>>> workers and executed the command
>>>>>>>>>>>>
>>>>>>>>>>>> > pio train -- --master spark://*.*.*.*:7077 --driver-memory 50G
>>>>>>>>>>>> > --executor-memory 50G
>>>>>>>>>>>>
>>>>>>>>>>>> And after some times getting the error . Executor failed to
>>>>>>>>>>>> connect with master and training gets stopped.
>>>>>>>>>>>>
>>>>>>>>>>>> I have changed the feature count from 6500 - > 500 and still
>>>>>>>>>>>> the condition is same. So can anyone suggest me am I missing
>>>>>>>>>>>> something
>>>>>>>>>>>>
>>>>>>>>>>>> and In between training getting continuous warnings like :
>>>>>>>>>>>> [
>>>>>>>>>>>>
>>>>>>>>>>>> > WARN] [ScannerCallable] Ignore, probably already closed
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Abhimanyu
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>