[Worker Crashing] OutOfMemoryError: GC overhead limit execeeded

2017-03-24 Thread bsikander
Spark version: 1.6.2
Hadoop: 2.6.0

Cluster:
All VMS are deployed on AWS.
1 Master (t2.large)
1 Secondary Master (t2.large)
5 Workers (m4.xlarge)
Zookeeper (t2.large)

Recently, 2 of our workers went down with out of memory exception. 
java.lang.OutOfMemoryError: GC overhead limit exceeded (max heap: 1024 MB)

Both of these worker processes were in hanged state. We restarted them to
bring them back to normal state.

Here is the complete exception 
https://gist.github.com/bsikander/84f1a0f3cc831c7a120225a71e435d91

Master's spark-default.conf file: 
https://gist.github.com/bsikander/4027136f6a6c91eabad576495c4d797d

Master's spark-env.sh
https://gist.github.com/bsikander/42f76d7a8e4079098d8a2df3cdee8ee0

Slave's spark-default.conf file:
https://gist.github.com/bsikander/54264349b49e6227c6912eb14d344b8c

So, what could be the reason of our workers crashing due to OutOfMemory ?
How can we avoid that in future.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Worker-Crashing-OutOfMemoryError-GC-overhead-limit-execeeded-tp28535.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-11-07 Thread bsikander
Anyone ?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-11-08 Thread bsikander
Thank you for the reply.

I am currently not using SparkLauncher to launch my driver. Rather, I am
using the old fashion spark-submit and moving to SparkLauncher is not an
option right now.
Do I have any options there?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread bsikander

 

See the image. I am referring to this state when I say "Application State".



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-05 Thread bsikander
Thank you for the reply.

I am not a Spark expert but I was reading through the code and I thought
that the state was changed from SUBMITTED to RUNNING only after executors
(CoarseGrainedExecutorBackend) were registered.
https://github.com/apache/spark/commit/015f7ef503d5544f79512b626749a1f0c48b#diff-a755f3d892ff2506a7aa7db52022d77cR95

As you mentioned that Launcher has no idea about executors, probably my
understanding is not correct.



SparkListener is an option but it has its own pitfalls. 
1) If I use spark.extraListeners, I get all the events but I cannot
customize the Listener, since I have to pass the class as a string to
spark-submit/Launcher. 
2) If I use context.addSparkListener, I can customize the listener but then
I miss the onApplicationStart event. Also, I don't know the Spark's logic to
changing the state of application from WAITING -> RUNNING.

Maybe you can answer,
If I have a Spark job which needs 3 executors and cluster can only provide 1
executor, will the application be in WAITING or RUNNING ?
If I know the Spark's logic then I can program something with
SparkListener.onExecutorAdded event to correctly figure out the state.

One other alternate can be to use Spark Master Json (http://<>:8080/json),
but the problem with this is that it returns everything and I was not able
to find any way to filter ..



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-08 Thread bsikander
Qiao, Richard wrote
> Comparing #1 and #3, my understanding of “submitted” is “the jar is
> submitted to executors”. With this concept, you may define your own
> status.

In SparkLauncher, SUBMITTED means that the Driver was able to acquire cores
from Spark cluster and Launcher is waiting for Driver to connect back. Once
it connects back, the state of Driver is changed to CONNECTED.
As Marcelo mentioned, Launcher can only tell me about the Driver state and
it is not possible to guess the state of "application (executors)". For the
state of executors we can use SparkListener.

With the combination of both Launcher + Listener, I have a solution. As you
mentioned, that even if 1 executor is allocated to "application", the state
will change to RUNNING. So in my application, I change the status of my job
to RUNNING only if I receive RUNNING from Launcher and onExecuterAdded event
from SparkListener.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-08 Thread bsikander
Qiao, Richard wrote
> For your question of example, the answer is yes.

Perfect. I am assuming that this is true for Spark-standalone/YARN/Mesos.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-04 Thread bsikander
So, I tried to use SparkAppHandle.Listener with SparkLauncher as you
suggested. The behavior of Launcher is not what I expected.

1- If I start the job (using SparkLauncher) and my Spark cluster has enough
cores available, I receive events in my class extending
SparkAppHandle.Listener and I see the status getting changed from
UNKOWN->CONNECTED -> SUBMITTED -> RUNNING. All good here.

2- If my Spark cluster has cores only for my Driver process (running in
cluster mode) but no cores for my executor, then I still receive the RUNNING
event. I was expecting something else since my executor has no cores and
Master UI shows WAITING state for executors, listener should respond with
SUBMITTED state instead of RUNNING.

3- If my Spark cluster has no cores for even the driver process then
SparkLauncher invokes no events at all. The state stays in UNKNOWN. I would
have expected it to be in SUBMITTED state atleast.

*Is there any way with which I can reliably get the WAITING state of job?*
Driver=RUNNING, executor=RUNNING, overall state should be RUNNING
Driver=RUNNING, executor=WAITING overall state should be SUBMITTED/WAITING
Driver=WAITING, executor=WAITING overall state should be
CONNECTED/SUBMITTED/WAITING







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Programmatically get status of job (WAITING/RUNNING)

2017-12-07 Thread bsikander
Marcelo Vanzin wrote
> I'm not sure I follow you here. This is something that you are
> defining, not Spark.

Yes, you are right. In my code, 
1) my notion of RUNNING is that both driver + executors are in RUNNING
state.
2) my notion of WAITING is if any one of driver/executor is in WAITING
state.

So,
- SparkLauncher provides me the details about the "driver".
RUNNING/SUBMITTED/WAITING
- SparkListener provides me the details about the "executor" using
onExecutorAdded/onExecutorDeleted

I want to combine both SparkLauncher + SparkListener to achieve my view of
RUNNING/WAITING.

The only thing confusing me here is that I don't know how Spark internally
converts applications from WAITING to RUNNING state.
For example, if an application wanted 4 executors
(spark.executor.instances=4) but the spark cluster can only provide 1
executor. This means that I will only receive 1 onExecutorAdded event. Will
the application state change to RUNNING (even if 1 executor was allocated)?

If I am clear on this logic I can implement my feature.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SparkLauncher] stateChanged event not received in standalone cluster mode

2018-06-08 Thread bsikander
Thanks.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SparkLauncher] stateChanged event not received in standalone cluster mode

2018-06-06 Thread bsikander
Any help would be appreciated.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ClusterMode] -Dspark.master with missing secondary master IP

2018-06-27 Thread bsikander
We recently transitioned from client mode to cluster mode with Spark
Standalone deployment. We are using 2.2.1. We are also using SparkLauncher
to launch the driver.

The problem is that when my Driver is launched the spark.master property
(-Dspark.master) is set to only primary master IP. Something like
"-Dspark.master=spark://:7077" but I am passing the IP
and port of both primary and secondary master to Launcher.

*Check the following output of launcher.*

18/06/27 10:06:08 INFO RestSubmissionClient: Submitting a request to launch
an application in spark://:6066,:6066.
18/06/27 10:06:08 INFO RestSubmissionClient: Submission successfully created
as driver-20180627100608-0012. Polling submission state...
18/06/27 10:06:08 INFO RestSubmissionClient: Submitting a request for the
status of submission driver-20180627100608-0012 in
spark://:6066,:6066.
18/06/27 10:06:08 ERROR RestSubmissionClient: Error: Server responded with
message of unexpected type SubmissionStatusResponse.
18/06/27 10:06:08 INFO RestSubmissionClient: State of driver
driver-20180627100608-0012 is now RUNNING.
18/06/27 10:06:08 INFO RestSubmissionClient: Driver is running on worker
worker-20180627090529--44156 at :44156.
18/06/27 10:06:08 INFO RestSubmissionClient: Server responded with
CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "message" : "Driver successfully submitted as driver-20180627100608-0012",
  "serverSparkVersion" : "2.2.1",
  "submissionId" : "driver-20180627100608-0012",
  "success" : true
}


*Here is the verbose output of my SparkLauncher:*

 18/06/27 10:06:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
Parsed arguments:
  master  spark://:6066,:6066
  deployMode  cluster
  executorMemory  null
  executorCores   null
  totalExecutorCores  null
  propertiesFile  null
  driverMemory1g
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  null
  driverExtraJavaOptions  
  supervise   true
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   my.main.class
  primaryResource file:/path/to/jar/file.jar
  nametestName
  childArgs   [akka.tcp://MyTestSystem@:2552
jobManager-8acaf907-8696-4d8d-8127-fcb35ebae9fa random.conf]
  jarsnull
  packagesnull
  packagesExclusions  null
  repositoriesnull
  verbose true

Spark properties used, including those specified through
 --conf and those from the properties file null:
  (spark.driver.memory,1g)
 
(spark.executor.extraJavaOptions,-Dlog4j.configuration=file:log4j-server.properties)
  (spark.driver.extraJavaOptions,  )


Running Spark using the REST application submission protocol.
Main class:
org.apache.spark.deploy.rest.RestSubmissionClient
Arguments:
file:/path/to/jar/file.jar
my.main.class
akka.tcp://MyTestSystem@:2552
jobManager-8acaf907-8696-4d8d-8127-fcb35ebae9fa
random.conf
System properties:
(spark.driver.memory,1g)
(SPARK_SUBMIT,true)
(spark.executor.extraJavaOptions,-Dlog4j.configuration=file:log4j-server.properties)
(spark.driver.supervise,true)
(spark.app.name,testName)
(spark.driver.extraJavaOptions, )
(spark.jars,file:/path/to/jar/file.jar)
(spark.submit.deployMode,cluster)
(spark.master,spark://:6066,:6066)
Classpath elements:

*A few things that I find strange:*
1- I don't understand the following error
18/06/27 10:06:08 ERROR RestSubmissionClient: Error: Server responded with
message of unexpected type SubmissionStatusResponse.

2- Why Spark is not using both IPs to launch the driver?

Any help or guidance would be much appreciated.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ClusterMode] -Dspark.master with missing secondary master IP

2018-06-27 Thread bsikander
We switched the port from 7077 to 6066 because we were losing 20 seconds each
time we launched a driver. 10 seconds for failing to submit the driver on
:7077. After losing 20 seconds, it used to fallback to some old way of
driver submitions.

With 6066 we don't lose any time.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[SparkContext] will application immediately stop after sc.stop()?

2018-07-29 Thread bsikander
Is it possible that a job keeps on running for some time after
onApplicationEnd is fired?

For example,
I have a spark job which has 10 batches still to process and let's say that
the processing them will take 10 minutes. If I execute sparkContext.stop(),
I will receive onApplicationEnd immediately according to  this code

 
. But after receiving the event, will my spark job still continue to run for
10 minutes and finish the batch execution?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Strange behavior of Spark Masters during rolling update

2018-07-05 Thread bsikander
We have a Spark standalone cluster running on 2.2.1 in HA mode using
Zookeeper. Occasionally, we have a rolling update where first the Primary
master goes down and then Secondary node and then zookeeper nodes running on
there own VMs. In the image below, 


 

Legends:
- Green line shows Secondary Master
- Yellow line shows Primary Master
- "1.0" on vertical axis shows STANDBY
- "5.0" on vertical axis shows UP

In the image you can see the strange behavior of Spark Master's.
- At 16:25 why did the masters switched roles between STANDBY and ALIVE?

Any help would be appreciated.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Strange behavior of Spark Masters during rolling update

2018-07-09 Thread bsikander
Anyone? 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[REST API] Rest API unusable due to application id changing

2018-07-09 Thread bsikander
Spark gives a nice rest api to get metrics
https://spark.apache.org/docs/latest/monitoring.html#rest-api

The problem is that this API is based on application id, which can change if
we are running in supervise mode.

Any application which is created based on the rest-api has to deal with
changing application-id.

Is there any solution to this problem?
I am using Spark standalone with cluster mode and supervise mode.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
Any help would be much appreciated. This seems to be a common problem.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
I have scenarios for both.
So, I want to kill both batch and streaming midway, if required.

Usecase:
Normally, if everything is okay we don't kill the application but sometimes
while accessing external resources (like Kafka) something can go wrong. In
that case, the application can become useless because it is not doing
anything useful, so we want to kill it (midway). In such a case, when we
kill it, sometimes the application becomes a zombie and doesn't get killed
programmatically (atleast, this is what we found). A kill through Master UI
or manual using kill -9 is required to clean up the zombies.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Properly stop applications or jobs within the application

2018-03-08 Thread bsikander
I am running in Spark standalone mode. No YARN.

anyways, yarn application -kill is a manual process. I donot want that. I
was to properly kill the driver/application programatically.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Properly stop applications or jobs within the application

2018-03-06 Thread bsikander
It seems to be related to this issue from Kafka
https://issues.apache.org/jira/browse/KAFKA-1894



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ClusterMode] -Dspark.master with missing secondary master IP

2018-06-28 Thread bsikander
I did some further investigation.

If I launch a driver in cluster mode with master IPs like
spark://:7077,:7077, the the driver is launched with
both IPs and -Dspark.master property has both IPs.

But within the logs I see the following, it causes 20 second delay while
launching each driver
18/06/28 08:19:34 INFO RestSubmissionClient: Submitting a request to launch
an application in spark://:7077,:7077.
18/06/28 08:19:44 WARN RestSubmissionClient: Unable to connect to server
spark://:7077.
18/06/28 08:19:54 WARN RestSubmissionClient: Unable to connect to server
spark://:7077.
Warning: Master endpoint spark://:7077,:7077 was not a
REST server. Falling back to legacy submission gateway instead.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SparkLauncher] -Dspark.master with missing secondary master IP

2018-06-29 Thread bsikander
This is what my Driver launch command looks like, it only contains 1 master
in -Dspark.master property whereas from Launcher I am passing 2 with port
6066.

Launch Command: "/path/to/java" "-cp" "" "-Xmx1024M"
"-Dspark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j-server.properties"
"-Dspark.driver.extraJavaOptions=-XX:+UseConcMarkSweepGC -verbose:gc
-XX:+PrintGCTimeStamps -XX:+CMSClassUnloadingEnabled
-XX:MaxDirectMemorySize=512M -XX:+HeapDumpOnOutOfMemoryError
-Djava.net.preferIPv4Stack=true 
-Djava.io.tmpdir=/tmp/spark -Dorg.xerial.snappy.tempdir=/tmp/spark
-Dlog4j.configuration=file:log4j-server.properties  "
"-Dspark.submit.deployMode=cluster"
"-Dspark.master=spark://:7077" 
"-Dspark.driver.supervise=true" 
"-Dspark.driver.memory=1g" 
"-Dspark.app.name=myClass"
"-Dspark.jars=file:myJar.jar" 
"-XX:+UseConcMarkSweepGC" "-verbose:gc" "-XX:+PrintGCTimeStamps"
"-XX:+CMSClassUnloadingEnabled" "-XX:MaxDirectMemorySize=512M"
"-XX:+HeapDumpOnOutOfMemoryError" "-Djava.net.preferIPv4Stack=true"
"org.apache.spark.deploy.worker.DriverWrapper"
"spark://Worker@:36057"
"/path/to/spark/worker_dir/driver-20180629101706-0064/myJar.jar" "myClass"
"arg1" "arg2" "arg3"


*NOTE:
I am running my standalone spark cluster in HA mode using Zookeeper.*

Reason why I want both masters in driver command:
if the driver loses connection to the only master, it kills itself. You can
see in the following logs

2018-06-27 09:09:03,390 INFO appclient-register-master-threadpool-0
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []:
Connecting to master spark://:7077...
2018-06-27 09:09:23,390 INFO appclient-register-master-threadpool-0
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint []:
Connecting to master spark://:7077...
2018-06-27 09:09:43,392 ERROR appclient-registration-retry-thread
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []:
Application has been killed. Reason: All masters are unresponsive! Giving
up.
2018-06-27 09:09:43,392 WARN JobServer-akka.actor.default-dispatcher-15
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend []:
Application ID is not initialized yet.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [SparkLauncher] -Dspark.master with missing secondary master IP

2018-06-29 Thread bsikander
Can anyone please help.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-12 Thread bsikander
Forgot to add the link
https://jira.apache.org/jira/browse/KAFKA-5649



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread bsikander
We are facing an issue with very long scheduling delays in Spark (upto 1+
hours).
We are using Spark-standalone. The data is being pulled from Kafka.

Any help would be much appreciated.

I have attached the screenshots.
 
 
 
 







--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-07 Thread bsikander
Actually, our job runs fine for 17-18 hours and this behavior just suddenly
starts happening after that. 

We found the following ticket which is exactly what is happening in our
Kafka cluster also.
WARN Failed to send SSL Close message 
(org.apache.kafka.common.network.SslTransportLayer)

You also replied to this ticket with a problem very similar to ours.

what fix you did to avoid these SSL Close exceptions and long delays in
spark job?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-09 Thread bsikander
Could you please give some feedback.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[Spark UI] find driver for an application

2018-09-24 Thread bsikander
Hello,
I am having some troubles in using Spark Master UI to figure out some basic
information.
The process is too tedious.
I am using spark 2.2.1 with Spark standalone.

- In cluster mode, how to figure out which driver is related to which
application?
- In supervise mode, how to track the restarts? How many times it was
restarted, the app id of all the applications after restart and VM IP where
it was running.

Any help would be much appreciated.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Just to add to my previous message.
I am using Spark 2.2.2 standalone cluster manager and deploying the jobs in
cluster mode.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
I was able to reproduce the problem.

In the below repository, I have 2 sample jobs. Both are execution 1/0
(Arithmetic Exception) on the executor sides and but in case of
NetworkWordCount job, awaitTerminate throws the same exceptions (Job aborted
due to stage failure .) that I can see in the driver/executor logs and
terminates the spark job (which is expected) but in the other job
(QueueStream), I see the exceptions in driver/executor logs but no exception
is throw by awaitTerminate method and job continues.

https://github.com/bsikander/spark-reproduce/


I am trying to understand this behavior why, it happens. Any help would be
appreciated.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
umm, i am not sure if I got this fully.

It is a design decision to not have context.stop() right after
awaitTermination throws exception?

So, the ideology is that if after n tries (default 4) a task fails, the
spark should fail fast and let user know? Is this correct?


As you mentioned there are many error classes and as the chances of getting
an exception are quite high. If the above ideology is correct then it makes
it really hard to keep the job up and running all the time especially
streaming cases. 




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Ok great. I understood the ideology, thanks.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Streaming job, catch exceptions

2019-05-21 Thread bsikander
Ok, I found the reason.

In my QueueStream example, I have a while(true) which keeps on adding the
RDDs, my awaitTermination call if after the while loop. Since, the while
loop never exits, awaitTermination never gets fired and never get reported
the exceptions.


The above was just the problem with the code that I tried to show my problem
with.

My real problem was due to the shutdown behavior of Spark. Spark streaming
does the following

- context.start() triggers the pipeline, context.awaitTerminate() block the
current thread, whenever an exception is reported, awaitTerminated throws an
exception. Since generally, we never have any code after awaitTerminate, the
shutdown hooks get called which stops the spark context.

- I am using spark-jobserver, when an exception is reported from
awaitTerminate, jobserver catches the exception and updates the status of
job in database but the driver process keeps on running because the main
thread in driver is waiting for an Akka actor to shutdown which belongs to
jobserver. Since, it never shutsdown, the driver keeps on running and no one
executes a context.stop(). Since context.stop() is not executed, the
jobschedular and generator keeps on running and job also keeps on going.

This implicit behavior of Spark where it relies on shutdown hooks to close
the context is a bit strange. I believe that as soon as an exception is
reported, the spark should just execute context.stop(). This behavior can
have serious consequence e.g. data loss. Will fix it though.

What is your opinion on stopping the context as soon as an exception is
raised?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Streaming job, catch exceptions

2019-05-15 Thread bsikander
Any help would be much appreciated.

The error and question is quite generic, i believe that most experienced
users will be able to answer.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Streaming job, catch exceptions

2019-05-12 Thread bsikander
Hi,
Anyone? This should be a straight forward one :)



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Streaming job, catch exceptions

2019-05-12 Thread bsikander
>> Code would be very helpful,
I will try to put together something to post here.

>> 1. Writing in Java
I am using Scala


>> Wrapping the entire app in a try/catch
Once the SparkContext object is created, a Future is started where actions
and transformations are defined and streaming context is started. 
I am using spark-jobserver and here is how the job is started (job.runJob()
defines all the actions/transformations and starts the streaming context). 
See this

 
. As mentioned in my original message, I sometimes am able to catch
exception  in this block

 
.

>> 3. Executing in local mode
I am running in cluster mode.


>> The code that is throwing the exceptions is not executed locally in the
>> driver process. Spark is executing the failing code on the cluster.
Yea the code is executing in executors but once it fails 4 times, the
exceptions seems to be getting thrown on driver side.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.4.4 with Hadoop 3.2.0

2019-11-19 Thread bsikander
Hi,
Are Spark 2.4.4 and Hadoop 3.2.0 compatible?
I tried to search the mailing list but couldn't find anything relevant.

 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.4.4 with Hadoop 3.2.0

2019-11-26 Thread bsikander
It could be that CDH6 has the integration but somehow I am getting the
following very frequently while building Spark2.4.4 with Hadoop 3.2.0 and
running spark tests.

caused by: java.lang.IllegalArgumentException: Unrecognized Hadoop major
version number: 3.2.0
at
org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:174)
at
org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:139)
at
org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:100)
at
org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:368)





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-24 Thread bsikander
Any help would be much appreciated.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-31 Thread bsikander
Thank you for your reply.

Which resource manager has support for rolling update? YARN?
Also where can I find this information in the documentation?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-29 Thread bsikander
Anyone?
This question is not regarding my application running on top of Spark.
The question is about the upgrade of spark itself from 2.2 to 2.4.

I expected atleast that spark would recover from upgrades gracefully and
recover its own persisted objects.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Problems during upgrade 2.2.2 -> 2.4.4

2020-01-22 Thread bsikander
A few details about clusters

- Current Version 2.2
- Resource manager: Spark standalone
- Modes: cluster + supervise
- HA setup: Zookeeper
- Expected version after upgrade: 2.4.4

Note: Before and after the upgrade, everything works fine.

During the upgrade, I see number of issues.
- Spark master on version 2.4.4 tries to recover itself from zookeeper and
fails to deserialize the driver/app/worker objects and throws
InvalidClassException.
- Spark master (2.4.4) after failing to deserialize, deletes all the
information about driver/apps/workers from zookeeper and loses all contacts
to running JVMs.
- Sometimes mysteriously respawns the drivers but with new ids, having no
clue about old ones. Sometimes multiple "same" drivers are running at the
same time with different ids.
- Old spark workers (2.2) fails to communicate with new Spark master (2.4.4)

I checked the release notes and couldn't find anything regarding upgrades.

Could someone please help me answer a few questions above and maybe point me
to some documentation regarding upgrades. Or if the upgrades are not
working, then maybe some documentation which explains this would be helpful.

Exceptions as seen on master:
2020-01-21 23:58:09,010 INFO dispatcher-event-loop-1-EventThread
org.apache.spark.deploy.master.ZooKeeperLeaderElectionAgent: We have gained
leadership
2020-01-21 23:58:09,073 ERROR dispatcher-event-loop-1
org.apache.spark.util.Utils: Exception encountered
java.io.InvalidClassException: org.apache.spark.rpc.RpcEndpointRef; local
class incompatible: stream classdesc serialVersionUID = 1835832137613908542,
local class serialVersionUID = -1329125091869941550
at
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:687)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1880)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1746)
at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1880)
at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1746)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2037)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2282)
at
java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:558)
at
org.apache.spark.deploy.master.ApplicationInfo$$anonfun$readObject$1.apply$mcV$sp(ApplicationInfo.scala:55)
at
org.apache.spark.deploy.master.ApplicationInfo$$anonfun$readObject$1.apply(ApplicationInfo.scala:54)
at
org.apache.spark.deploy.master.ApplicationInfo$$anonfun$readObject$1.apply(ApplicationInfo.scala:54)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
at
org.apache.spark.deploy.master.ApplicationInfo.readObject(ApplicationInfo.scala:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1160)
at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2173)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2064)
at
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at
org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.org$apache$spark$deploy$master$ZooKeeperPersistenceEngine$$deserializeFromFile(ZooKeeperPersistenceEngine.scala:76)
at
org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$2.apply(ZooKeeperPersistenceEngine.scala:59)
at
org.apache.spark.deploy.master.ZooKeeperPersistenceEngine$$anonfun$read$2.apply(ZooKeeperPersistenceEngine.scala:59)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at
scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
at
org.apache.spark.deploy.master.ZooKeeperPersistenceEngine.read(ZooKeeperPersistenceEngine.scala:59)
at

Re: Problems during upgrade 2.2.2 -> 2.4.4

2020-01-22 Thread bsikander
After digging deeper, we found that apps/workers inside zookeeper are not
deserializable but drivers can.
Due to this driver comes up (mysteriously).

The deserialization is failing due to "RpcEndpointRef".

I think somebody should be able to point me to a solution now, i guess.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org