[Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread bsikander
heap: 1024 MB) Both of these worker processes were in hanged state. We restarted them to bring them back to normal state. Here is the complete exception https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91 Master's spark-default.conf file: https://gist.github.com/bsikander

Re: Programmatically get status of job (WAITING/RUNNING)

2017-11-07 Thread bsikander
Anyone ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Programmatically get status of job (WAITING/RUNNING)

2017-11-08 Thread bsikander
Thank you for the reply. I am currently not using SparkLauncher to launch my driver. Rather, I am using the old fashion spark-submit and moving to SparkLauncher is not an option right now. Do I have any options there? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread bsikander
See the image. I am referring to this state when I say "Application State". -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread bsikander
Thank you for the reply. I am not a Spark expert but I was reading through the code and I thought that the state was changed from SUBMITTED to RUNNING only after executors (CoarseGrainedExecutorBackend) were registered.

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-08 Thread bsikander
Qiao, Richard wrote > Comparing #1 and #3, my understanding of “submitted” is “the jar is > submitted to executors”. With this concept, you may define your own > status. In SparkLauncher, SUBMITTED means that the Driver was able to acquire cores from Spark cluster and Launcher is waiting for

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-08 Thread bsikander
Qiao, Richard wrote > For your question of example, the answer is yes. Perfect. I am assuming that this is true for Spark-standalone/YARN/Mesos. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-04 Thread bsikander
So, I tried to use SparkAppHandle.Listener with SparkLauncher as you suggested. The behavior of Launcher is not what I expected. 1- If I start the job (using SparkLauncher) and my Spark cluster has enough cores available, I receive events in my class extending SparkAppHandle.Listener and I see

Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread bsikander
Marcelo Vanzin wrote > I'm not sure I follow you here. This is something that you are > defining, not Spark. Yes, you are right. In my code, 1) my notion of RUNNING is that both driver + executors are in RUNNING state. 2) my notion of WAITING is if any one of driver/executor is in WAITING state.

Re: [SparkLauncher] stateChanged event not received in standalone cluster mode

2018-06-08 Thread bsikander
Thanks. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [SparkLauncher] stateChanged event not received in standalone cluster mode

2018-06-06 Thread bsikander
Any help would be appreciated. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[ClusterMode] -Dspark.master with missing secondary master IP

2018-06-27 Thread bsikander
We recently transitioned from client mode to cluster mode with Spark Standalone deployment. We are using 2.2.1. We are also using SparkLauncher to launch the driver. The problem is that when my Driver is launched the spark.master property (-Dspark.master) is set to only primary master IP.

Re: [ClusterMode] -Dspark.master with missing secondary master IP

2018-06-27 Thread bsikander
We switched the port from 7077 to 6066 because we were losing 20 seconds each time we launched a driver. 10 seconds for failing to submit the driver on :7077. After losing 20 seconds, it used to fallback to some old way of driver submitions. With 6066 we don't lose any time. -- Sent from:

[SparkContext] will application immediately stop after sc.stop()?

2018-07-29 Thread bsikander
Is it possible that a job keeps on running for some time after onApplicationEnd is fired? For example, I have a spark job which has 10 batches still to process and let's say that the processing them will take 10 minutes. If I execute sparkContext.stop(), I will receive onApplicationEnd

Strange behavior of Spark Masters during rolling update

2018-07-05 Thread bsikander
We have a Spark standalone cluster running on 2.2.1 in HA mode using Zookeeper. Occasionally, we have a rolling update where first the Primary master goes down and then Secondary node and then zookeeper nodes running on there own VMs. In the image below,

Re: Strange behavior of Spark Masters during rolling update

2018-07-09 Thread bsikander
Anyone? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[REST API] Rest API unusable due to application id changing

2018-07-09 Thread bsikander
Spark gives a nice rest api to get metrics https://spark.apache.org/docs/latest/monitoring.html#rest-api The problem is that this API is based on application id, which can change if we are running in supervise mode. Any application which is created based on the rest-api has to deal with changing

Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
Any help would be much appreciated. This seems to be a common problem. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
I have scenarios for both. So, I want to kill both batch and streaming midway, if required. Usecase: Normally, if everything is okay we don't kill the application but sometimes while accessing external resources (like Kafka) something can go wrong. In that case, the application can become useless

Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
I am running in Spark standalone mode. No YARN. anyways, yarn application -kill is a manual process. I donot want that. I was to properly kill the driver/application programatically. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Properly stop applications or jobs within the application

2018-03-06 Thread bsikander
It seems to be related to this issue from Kafka https://issues.apache.org/jira/browse/KAFKA-1894 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [ClusterMode] -Dspark.master with missing secondary master IP

2018-06-28 Thread bsikander
I did some further investigation. If I launch a driver in cluster mode with master IPs like spark://:7077,:7077, the the driver is launched with both IPs and -Dspark.master property has both IPs. But within the logs I see the following, it causes 20 second delay while launching each driver

Re: [SparkLauncher] -Dspark.master with missing secondary master IP

2018-06-29 Thread bsikander
This is what my Driver launch command looks like, it only contains 1 master in -Dspark.master property whereas from Launcher I am passing 2 with port 6066. Launch Command: "/path/to/java" "-cp" "" "-Xmx1024M" "-Dspark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j-server.properties"

Re: [SparkLauncher] -Dspark.master with missing secondary master IP

2018-06-29 Thread bsikander
Can anyone please help. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-12 Thread bsikander
Forgot to add the link https://jira.apache.org/jira/browse/KAFKA-5649 -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread bsikander
We are facing an issue with very long scheduling delays in Spark (upto 1+ hours). We are using Spark-standalone. The data is being pulled from Kafka. Any help would be much appreciated. I have attached the screenshots.

Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread bsikander
Actually, our job runs fine for 17-18 hours and this behavior just suddenly starts happening after that. We found the following ticket which is exactly what is happening in our Kafka cluster also. WARN Failed to send SSL Close message (org.apache.kafka.common.network.SslTransportLayer) You

Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-09 Thread bsikander
Could you please give some feedback. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[Spark UI] find driver for an application

2018-09-24 Thread bsikander
Hello, I am having some troubles in using Spark Master UI to figure out some basic information. The process is too tedious. I am using spark 2.2.1 with Spark standalone. - In cluster mode, how to figure out which driver is related to which application? - In supervise mode, how to track the

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Just to add to my previous message. I am using Spark 2.2.2 standalone cluster manager and deploying the jobs in cluster mode. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
in the driver/executor logs and terminates the spark job (which is expected) but in the other job (QueueStream), I see the exceptions in driver/executor logs but no exception is throw by awaitTerminate method and job continues. https://github.com/bsikander/spark-reproduce/ I am trying to understand

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
umm, i am not sure if I got this fully. It is a design decision to not have context.stop() right after awaitTermination throws exception? So, the ideology is that if after n tries (default 4) a task fails, the spark should fail fast and let user know? Is this correct? As you mentioned there

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Ok great. I understood the ideology, thanks. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Ok, I found the reason. In my QueueStream example, I have a while(true) which keeps on adding the RDDs, my awaitTermination call if after the while loop. Since, the while loop never exits, awaitTermination never gets fired and never get reported the exceptions. The above was just the problem

Re: Streaming job, catch exceptions

2019-05-15 Thread bsikander
Any help would be much appreciated. The error and question is quite generic, i believe that most experienced users will be able to answer. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Re: Streaming job, catch exceptions

2019-05-12 Thread bsikander
Hi, Anyone? This should be a straight forward one :) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Streaming job, catch exceptions

2019-05-12 Thread bsikander
>> Code would be very helpful, I will try to put together something to post here. >> 1. Writing in Java I am using Scala >> Wrapping the entire app in a try/catch Once the SparkContext object is created, a Future is started where actions and transformations are defined and streaming context is

Spark 2.4.4 with Hadoop 3.2.0

2019-11-19 Thread bsikander
Hi, Are Spark 2.4.4 and Hadoop 3.2.0 compatible? I tried to search the mailing list but couldn't find anything relevant. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Spark 2.4.4 with Hadoop 3.2.0

2019-11-26 Thread bsikander
It could be that CDH6 has the integration but somehow I am getting the following very frequently while building Spark2.4.4 with Hadoop 3.2.0 and running spark tests. caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.2.0 at

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-24 Thread bsikander
Any help would be much appreciated. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-31 Thread bsikander
Thank you for your reply. Which resource manager has support for rolling update? YARN? Also where can I find this information in the documentation? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-29 Thread bsikander
Anyone? This question is not regarding my application running on top of Spark. The question is about the upgrade of spark itself from 2.2 to 2.4. I expected atleast that spark would recover from upgrades gracefully and recover its own persisted objects. -- Sent from:

Problems during upgrade 2.2.2 -> 2.4.4

2020-01-22 Thread bsikander
A few details about clusters - Current Version 2.2 - Resource manager: Spark standalone - Modes: cluster + supervise - HA setup: Zookeeper - Expected version after upgrade: 2.4.4 Note: Before and after the upgrade, everything works fine. During the upgrade, I see number of issues. - Spark

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-22 Thread bsikander
After digging deeper, we found that apps/workers inside zookeeper are not deserializable but drivers can. Due to this driver comes up (mysteriously). The deserialization is failing due to "RpcEndpointRef". I think somebody should be able to point me to a solution now, i guess. -- Sent from: