Re: Spark Window Documentation

2020-05-08 Thread Jacek Laskowski
Hi Neeraj,

I'd start from "Contributing Documentation Changes" in
https://spark.apache.org/contributing.html

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Fri, May 8, 2020 at 10:37 PM neeraj bhadani 
wrote:

> Thanks Jacek for sharing the details. I could see some example here
> https://github.com/apache/spark/blob/master/python/pyspark/sql/window.py#L83 
> as
> mentioned in original email but not sure where this is reflecting on spark
> documentation. Also, what would be the process to contribute to the spark
> docs. I check the section "Contributing Documentation Changes" at this
> link : h 
> ttps://spark.apache.org/contributing.html
>  but couldn't find a way to
> contribute. I might be missing something here.
>
> If someone can help on how to contribute to the spark docs would be great.
>
> Regards,
> Neeraj
>
>
> On Fri, May 8, 2020 at 12:39 PM Jacek Laskowski  wrote:
>
>> Hi Neeraj,
>>
>> I'm not a committer so I might be wrong, but there is no "blessed way" to
>> include examples.
>>
>> There are some examples in the official documentation at
>> http://spark.apache.org/docs/latest/sql-programming-guide.html but this
>> is how to use the general concepts not specific operators.
>>
>> There are some examples at http://spark.apache.org/examples.html
>>
>> I think the best way would be to include examples as close to the methods
>> as possible and scaladoc/javadoc would be best IMHO.
>>
>> p.s. Just yesterday there was this thread "What open source projects have
>> the best docs?" on twitter @
>> https://twitter.com/adamwathan/status/1257641015835611138. You could
>> borrow some ideas of the docs that are claimed "the best".
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Fri, May 8, 2020 at 11:34 AM neeraj bhadani <
>> bhadani.neeraj...@gmail.com> wrote:
>>
>>> Hi Team,
>>> I was looking for a Spark window function example on documentation.
>>>
>>> For example, I could the function definition and params are explained
>>> nicely here:
>>> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window.rowsBetween
>>>
>>> and this is the source which is available since spark version 2.1:
>>> https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween
>>>
>>> But I couldn't find an example which helps to understand How it works.
>>>
>>> Although, while browsing the GitHub code I have found some example here:
>>> https://github.com/apache/spark/blob/master/python/pyspark/sql/window.py#L83
>>>
>>> which I couldn't find on the spark official doc page. Where and how this
>>> example is linked with the official spark documentation.
>>>
>>> If such examples are not available, Could you please share the process
>>> on how I can contribute examples to the spark documentation.
>>>
>>> Regards,
>>> Neeraj
>>>
>>


Re: Spark Window Documentation

2020-05-08 Thread neeraj bhadani
Thanks Jacek for sharing the details. I could see some example here
https://github.com/apache/spark/blob/master/python/pyspark/sql/window.py#L83 as
mentioned in original email but not sure where this is reflecting on spark
documentation. Also, what would be the process to contribute to the spark
docs. I check the section "Contributing Documentation Changes" at this link
: h 
ttps://spark.apache.org/contributing.html
 but couldn't find a way to
contribute. I might be missing something here.

If someone can help on how to contribute to the spark docs would be great.

Regards,
Neeraj


On Fri, May 8, 2020 at 12:39 PM Jacek Laskowski  wrote:

> Hi Neeraj,
>
> I'm not a committer so I might be wrong, but there is no "blessed way" to
> include examples.
>
> There are some examples in the official documentation at
> http://spark.apache.org/docs/latest/sql-programming-guide.html but this
> is how to use the general concepts not specific operators.
>
> There are some examples at http://spark.apache.org/examples.html
>
> I think the best way would be to include examples as close to the methods
> as possible and scaladoc/javadoc would be best IMHO.
>
> p.s. Just yesterday there was this thread "What open source projects have
> the best docs?" on twitter @
> https://twitter.com/adamwathan/status/1257641015835611138. You could
> borrow some ideas of the docs that are claimed "the best".
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Fri, May 8, 2020 at 11:34 AM neeraj bhadani <
> bhadani.neeraj...@gmail.com> wrote:
>
>> Hi Team,
>> I was looking for a Spark window function example on documentation.
>>
>> For example, I could the function definition and params are explained
>> nicely here:
>> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window.rowsBetween
>>
>> and this is the source which is available since spark version 2.1:
>> https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween
>>
>> But I couldn't find an example which helps to understand How it works.
>>
>> Although, while browsing the GitHub code I have found some example here:
>> https://github.com/apache/spark/blob/master/python/pyspark/sql/window.py#L83
>>
>> which I couldn't find on the spark official doc page. Where and how this
>> example is linked with the official spark documentation.
>>
>> If such examples are not available, Could you please share the process on
>> how I can contribute examples to the spark documentation.
>>
>> Regards,
>> Neeraj
>>
>


Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Russell Spitzer
The error is in the Spark Standalone Worker. It's hitting an OOM while
launching/running an executor process. Specifically it's running out of
memory when parsing the hadoop configuration trying to figure out the
env/command line to run

https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala#L142-L149

Now usually this is something that I wouldn't expect to happen, since a
Spark Worker is generally a very lightweight process. Unless it was
accumulating a lot of state it should be relatively small and it should be
very unlikely that generating a command
line string would cause this error unless the application configuration was
gigantic. So while it's possible you just have very large hadoop.xml files
it is probably not this specific action that is ooming, but rather this is
the straw that broke
the camel's back and the worker just has too much other state.

This may not be pathologic, and it may just be you are running a lot of
executors, or it's keeping track of lots of started and shutdown executor
metadata or something like that and it's not a big deal.
You could fix this by limiting the amount of metadata preserved after jobs
are run see (spark.deploy.* options for retaining apps and spark worker
cleanup)
or by increasing the  Spark Worker's heap (SPARK_DAEMON_MEMORY)

If I hit this I would start by bumping Daemon Memory

On Fri, May 8, 2020 at 11:59 AM Hrishikesh Mishra 
wrote:

> We submit spark job through spark-submit command, Like below one.
>
>
> sudo /var/lib/pf-spark/bin/spark-submit \
> --total-executor-cores 30 \
> --driver-cores 2 \
> --class com.hrishikesh.mishra.Main\
> --master spark://XX.XX.XXX.19:6066  \
> --deploy-mode cluster  \
> --supervise
> http://XX.XX.XXX.19:90/jar/fk-runner-framework-1.0-SNAPSHOT.jar
>
>
>
>
> We have python http server, where we hosted all jars.
>
> The user kill the driver driver-20200508153502-1291 and its visible in log
> also, but this is not problem. OOM is separate from this.
>
> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
> driver-20200508153502-1291
>
> 20/05/08 15:36:55 INFO DriverRunner: Killing driver process!
>
> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
> /grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed
>
> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
> /grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed
>
> 20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
> app-20200508153654-11776 removed, cleanupLocalDirs = true
>
> 20/05/08 *15:36:55* INFO Worker: Driver* driver-20200508153502-1291 was
> killed by user*
>
> *20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
> stacktrace] was thrown by a user handler's exceptionCaught() method while
> handling the following exception:*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> *20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[dispatcher-event-loop-6,5,main]*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> *20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
> stacktrace] was thrown by a user handler's exceptionCaught() method while
> handling the following exception:*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called
>
> 20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
> /grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922
>
>
> On Fri, May 8, 2020 at 9:27 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> It's been a while since I worked with Spark Standalone, but I'd check the
>> logs of the workers. How do you spark-submit the app?
>>
>> DId you check /grid/1/spark/work/driver-20200508153502-1291 directory?
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Fri, May 8, 2020 at 2:32 PM Hrishikesh Mishra 
>> wrote:
>>
>>> Thanks Jacek for quick response.
>>> Due to our system constraints, we can't move to Structured Streaming
>>> now. But definitely YARN can be tried out.
>>>
>>> But my problem is I'm able to figure out where is the issue, Driver,
>>> Executor, or Worker. Even exceptions are clueless.  Please see the below
>>> exception, I'm unable to spot the issue for OOM.
>>>
>>> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
>>> driver-20200508153502-1291
>>>
>>> 20/05/08 15:36:55 INFO DriverRunner: Killing driver process!
>>>
>>> 20/05/08 15:36:55 INFO 

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Hrishikesh Mishra
We submit spark job through spark-submit command, Like below one.


sudo /var/lib/pf-spark/bin/spark-submit \
--total-executor-cores 30 \
--driver-cores 2 \
--class com.hrishikesh.mishra.Main\
--master spark://XX.XX.XXX.19:6066  \
--deploy-mode cluster  \
--supervise http://XX.XX.XXX.19:90/jar/fk-runner-framework-1.0-SNAPSHOT.jar




We have python http server, where we hosted all jars.

The user kill the driver driver-20200508153502-1291 and its visible in log
also, but this is not problem. OOM is separate from this.

20/05/08 15:36:55 INFO Worker: Asked to kill driver
driver-20200508153502-1291

20/05/08 15:36:55 INFO DriverRunner: Killing driver process!

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed

20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
app-20200508153654-11776 removed, cleanupLocalDirs = true

20/05/08 *15:36:55* INFO Worker: Driver* driver-20200508153502-1291 was
killed by user*

*20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[dispatcher-event-loop-6,5,main]*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called

20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
/grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922


On Fri, May 8, 2020 at 9:27 PM Jacek Laskowski  wrote:

> Hi,
>
> It's been a while since I worked with Spark Standalone, but I'd check the
> logs of the workers. How do you spark-submit the app?
>
> DId you check /grid/1/spark/work/driver-20200508153502-1291 directory?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Fri, May 8, 2020 at 2:32 PM Hrishikesh Mishra 
> wrote:
>
>> Thanks Jacek for quick response.
>> Due to our system constraints, we can't move to Structured Streaming now.
>> But definitely YARN can be tried out.
>>
>> But my problem is I'm able to figure out where is the issue, Driver,
>> Executor, or Worker. Even exceptions are clueless.  Please see the below
>> exception, I'm unable to spot the issue for OOM.
>>
>> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
>> driver-20200508153502-1291
>>
>> 20/05/08 15:36:55 INFO DriverRunner: Killing driver process!
>>
>> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
>> /grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed
>>
>> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
>> /grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed
>>
>> 20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
>> app-20200508153654-11776 removed, cleanupLocalDirs = true
>>
>> 20/05/08 15:36:55 INFO Worker: Driver driver-20200508153502-1291 was
>> killed by user
>>
>> *20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
>> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
>> stacktrace] was thrown by a user handler's exceptionCaught() method while
>> handling the following exception:*
>>
>> *java.lang.OutOfMemoryError: Java heap space*
>>
>> *20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught
>> exception in thread Thread[dispatcher-event-loop-6,5,main]*
>>
>> *java.lang.OutOfMemoryError: Java heap space*
>>
>> *20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
>> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
>> stacktrace] was thrown by a user handler's exceptionCaught() method while
>> handling the following exception:*
>>
>> *java.lang.OutOfMemoryError: Java heap space*
>>
>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>
>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>
>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>
>> 20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called
>>
>> 20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
>

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Jacek Laskowski
Hi,

It's been a while since I worked with Spark Standalone, but I'd check the
logs of the workers. How do you spark-submit the app?

DId you check /grid/1/spark/work/driver-20200508153502-1291 directory?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Fri, May 8, 2020 at 2:32 PM Hrishikesh Mishra 
wrote:

> Thanks Jacek for quick response.
> Due to our system constraints, we can't move to Structured Streaming now.
> But definitely YARN can be tried out.
>
> But my problem is I'm able to figure out where is the issue, Driver,
> Executor, or Worker. Even exceptions are clueless.  Please see the below
> exception, I'm unable to spot the issue for OOM.
>
> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
> driver-20200508153502-1291
>
> 20/05/08 15:36:55 INFO DriverRunner: Killing driver process!
>
> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
> /grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed
>
> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
> /grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed
>
> 20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
> app-20200508153654-11776 removed, cleanupLocalDirs = true
>
> 20/05/08 15:36:55 INFO Worker: Driver driver-20200508153502-1291 was
> killed by user
>
> *20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
> stacktrace] was thrown by a user handler's exceptionCaught() method while
> handling the following exception:*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> *20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[dispatcher-event-loop-6,5,main]*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> *20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
> stacktrace] was thrown by a user handler's exceptionCaught() method while
> handling the following exception:*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>
> 20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called
>
> 20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
> /grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922
>
>
>
>
> On Fri, May 8, 2020 at 5:14 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Sorry for being perhaps too harsh, but when you asked "Am I missing
>> something. " and I noticed this "Kafka Direct Stream" and "Spark Standalone
>> Cluster. " I immediately thought "Yeah...please upgrade your Spark env to
>> use Spark Structured Streaming at the very least and/or use YARN as the
>> cluster manager".
>>
>> Another thought was that the user code (your code) could be leaking
>> resources so Spark eventually reports heap-related errors that may not
>> necessarily be Spark's.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://about.me/JacekLaskowski
>> "The Internals Of" Online Books 
>> Follow me on https://twitter.com/jaceklaskowski
>>
>> 
>>
>>
>> On Thu, May 7, 2020 at 1:12 PM Hrishikesh Mishra 
>> wrote:
>>
>>> Hi
>>>
>>> I am getting out of memory error in worker log in streaming jobs in
>>> every couple of hours. After this worker dies. There is no shuffle, no
>>> aggression, no. caching  in job, its just a transformation.
>>> I'm not able to identify where is the problem, driver or executor. And
>>> why worker getting dead after the OOM streaming job should die. Am I
>>> missing something.
>>>
>>> Driver Memory:  2g
>>> Executor memory: 4g
>>>
>>> Spark Version:  2.4
>>> Kafka Direct Stream
>>> Spark Standalone Cluster.
>>>
>>>
>>> 20/05/06 12:52:20 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users  with view permissions: Set(root); groups
>>> with view permissions: Set(); users  with modify permissions: Set(root);
>>> groups with modify permissions: Set()
>>>
>>> 20/05/06 12:53:03 ERROR SparkUncaughtExceptionHandler: Uncaught
>>> exception in thread Thread[ExecutorRunner for
>>> app-20200506124717-10226/0,5,main]
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>>
>>> at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source)
>>>
>>> at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source)
>>>
>>> at org.apache.xerces.impl.XMLScanner.scanComment(Unknown Source)
>>>
>>> at
>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanComment(Unknown
>>> Source)
>>>
>>> at
>>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>>> Source)

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Hrishikesh Mishra
Thanks Jacek for quick response.
Due to our system constraints, we can't move to Structured Streaming now.
But definitely YARN can be tried out.

But my problem is I'm able to figure out where is the issue, Driver,
Executor, or Worker. Even exceptions are clueless.  Please see the below
exception, I'm unable to spot the issue for OOM.

20/05/08 15:36:55 INFO Worker: Asked to kill driver
driver-20200508153502-1291

20/05/08 15:36:55 INFO DriverRunner: Killing driver process!

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed

20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
app-20200508153654-11776 removed, cleanupLocalDirs = true

20/05/08 15:36:55 INFO Worker: Driver driver-20200508153502-1291 was killed
by user

*20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[dispatcher-event-loop-6,5,main]*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called

20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
/grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922




On Fri, May 8, 2020 at 5:14 PM Jacek Laskowski  wrote:

> Hi,
>
> Sorry for being perhaps too harsh, but when you asked "Am I missing
> something. " and I noticed this "Kafka Direct Stream" and "Spark Standalone
> Cluster. " I immediately thought "Yeah...please upgrade your Spark env to
> use Spark Structured Streaming at the very least and/or use YARN as the
> cluster manager".
>
> Another thought was that the user code (your code) could be leaking
> resources so Spark eventually reports heap-related errors that may not
> necessarily be Spark's.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Thu, May 7, 2020 at 1:12 PM Hrishikesh Mishra 
> wrote:
>
>> Hi
>>
>> I am getting out of memory error in worker log in streaming jobs in every
>> couple of hours. After this worker dies. There is no shuffle, no
>> aggression, no. caching  in job, its just a transformation.
>> I'm not able to identify where is the problem, driver or executor. And
>> why worker getting dead after the OOM streaming job should die. Am I
>> missing something.
>>
>> Driver Memory:  2g
>> Executor memory: 4g
>>
>> Spark Version:  2.4
>> Kafka Direct Stream
>> Spark Standalone Cluster.
>>
>>
>> 20/05/06 12:52:20 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users  with view permissions: Set(root); groups
>> with view permissions: Set(); users  with modify permissions: Set(root);
>> groups with modify permissions: Set()
>>
>> 20/05/06 12:53:03 ERROR SparkUncaughtExceptionHandler: Uncaught exception
>> in thread Thread[ExecutorRunner for app-20200506124717-10226/0,5,main]
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source)
>>
>> at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source)
>>
>> at org.apache.xerces.impl.XMLScanner.scanComment(Unknown Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanComment(Unknown
>> Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>> Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>> Source)
>>
>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>
>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>
>> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>>
>> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>>
>> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>>
>> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
>>
>> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2480)
>>
>> at org.apache.hadoop.conf.

Re: Spark structured streaming - performance tuning

2020-05-08 Thread Srinivas V
Anyone else can answer below questions on performance tuning Structured
streaming?
@Jacek?

On Sun, May 3, 2020 at 12:07 AM Srinivas V  wrote:

> Hi Alex, read the book , it is a good one but i don’t see things which I
> strongly want to understand.
> You are right on the partition and tasks.
> 1.How to use coalesce with spark structured streaming ?
>
> Also I want to ask few more questions,
> 2. How to restrict number of executors on structured streaming?
>  —num-executors is minimum is it ?
> To cap max, can I use spark.dynamicAllocation.maxExecutors ?
>
> 3. Does other streaming properties hold good for structured streaming?
> Like spark.streaming.dynamicAllocation.enabled ?
> If not what are the ones it takes into consideration?
>
> 4. Does structured streaming 2.4.5 allow dynamicAllocation of executors/
> cores? In case of Kafka consumer, when the cluster has to scale down, does
> it reconfigure the mapping of executors cores to kaka partitions?
>
> 5. Why spark srtructured  Streaming web ui (SQL tab) is not so informative
> like streaming tab of Spark streaming ?
>
> It would be great if these questions are answered, otherwise the only
> option left would be to go through the spark code and figure out.
>
> On Sat, Apr 18, 2020 at 1:09 PM Alex Ott  wrote:
>
>> Just to clarify - I didn't write this explicitly in my answer. When you're
>> working with Kafka, every partition in Kafka is mapped into Spark
>> partition. And in Spark, every partition is mapped into task.   But you
>> can
>> use `coalesce` to decrease the number of Spark partitions, so you'll have
>> less tasks...
>>
>> Srinivas V  at "Sat, 18 Apr 2020 10:32:33 +0530" wrote:
>>  SV> Thank you Alex. I will check it out and let you know if I have any
>> questions
>>
>>  SV> On Fri, Apr 17, 2020 at 11:36 PM Alex Ott  wrote:
>>
>>  SV> http://shop.oreilly.com/product/0636920047568.do has quite good
>> information
>>  SV> on it.  For Kafka, you need to start with approximation that
>> processing of
>>  SV> each partition is a separate task that need to be executed, so
>> you need to
>>  SV> plan number of cores correspondingly.
>>  SV>
>>  SV> Srinivas V  at "Thu, 16 Apr 2020 22:49:15 +0530" wrote:
>>  SV>  SV> Hello,
>>  SV>  SV> Can someone point me to a good video or document which
>> takes about performance tuning for structured streaming app?
>>  SV>  SV> I am looking especially for listening to Kafka topics say 5
>> topics each with 100 portions .
>>  SV>  SV> Trying to figure out best cluster size and number of
>> executors and cores required.
>>
>>
>> --
>> With best wishes,Alex Ott
>> http://alexott.net/
>> Twitter: alexott_en (English), alexott (Russian)
>>
>


Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Jacek Laskowski
Hi,

Sorry for being perhaps too harsh, but when you asked "Am I missing
something. " and I noticed this "Kafka Direct Stream" and "Spark Standalone
Cluster. " I immediately thought "Yeah...please upgrade your Spark env to
use Spark Structured Streaming at the very least and/or use YARN as the
cluster manager".

Another thought was that the user code (your code) could be leaking
resources so Spark eventually reports heap-related errors that may not
necessarily be Spark's.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Thu, May 7, 2020 at 1:12 PM Hrishikesh Mishra 
wrote:

> Hi
>
> I am getting out of memory error in worker log in streaming jobs in every
> couple of hours. After this worker dies. There is no shuffle, no
> aggression, no. caching  in job, its just a transformation.
> I'm not able to identify where is the problem, driver or executor. And why
> worker getting dead after the OOM streaming job should die. Am I missing
> something.
>
> Driver Memory:  2g
> Executor memory: 4g
>
> Spark Version:  2.4
> Kafka Direct Stream
> Spark Standalone Cluster.
>
>
> 20/05/06 12:52:20 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users  with view permissions: Set(root); groups
> with view permissions: Set(); users  with modify permissions: Set(root);
> groups with modify permissions: Set()
>
> 20/05/06 12:53:03 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[ExecutorRunner for app-20200506124717-10226/0,5,main]
>
> java.lang.OutOfMemoryError: Java heap space
>
> at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source)
>
> at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source)
>
> at org.apache.xerces.impl.XMLScanner.scanComment(Unknown Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanComment(Unknown
> Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
>
> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2480)
>
> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2468)
>
> at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2539)
>
> at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
>
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
>
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)
>
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopConfigurations(SparkHadoopUtil.scala:464)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:436)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:114)
>
> at org.apache.spark.SecurityManager.(SecurityManager.scala:114)
>
> at org.apache.spark.deploy.worker.ExecutorRunner.org
> $apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:149)
>
> at
> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
>
> 20/05/06 12:53:38 INFO DriverRunner: Worker shutting down, killing driver
> driver-20200505181719-1187
>
> 20/05/06 12:53:38 INFO DriverRunner: Killing driver process!
>
>
>
>
> Regards
> Hrishi
>


Re: Spark Window Documentation

2020-05-08 Thread Jacek Laskowski
Hi Neeraj,

I'm not a committer so I might be wrong, but there is no "blessed way" to
include examples.

There are some examples in the official documentation at
http://spark.apache.org/docs/latest/sql-programming-guide.html but this is
how to use the general concepts not specific operators.

There are some examples at http://spark.apache.org/examples.html

I think the best way would be to include examples as close to the methods
as possible and scaladoc/javadoc would be best IMHO.

p.s. Just yesterday there was this thread "What open source projects have
the best docs?" on twitter @
https://twitter.com/adamwathan/status/1257641015835611138. You could borrow
some ideas of the docs that are claimed "the best".

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Fri, May 8, 2020 at 11:34 AM neeraj bhadani 
wrote:

> Hi Team,
> I was looking for a Spark window function example on documentation.
>
> For example, I could the function definition and params are explained
> nicely here:
> https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window.rowsBetween
>
> and this is the source which is available since spark version 2.1:
> https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween
>
> But I couldn't find an example which helps to understand How it works.
>
> Although, while browsing the GitHub code I have found some example here:
> https://github.com/apache/spark/blob/master/python/pyspark/sql/window.py#L83
>
> which I couldn't find on the spark official doc page. Where and how this
> example is linked with the official spark documentation.
>
> If such examples are not available, Could you please share the process on
> how I can contribute examples to the spark documentation.
>
> Regards,
> Neeraj
>


Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Hrishikesh Mishra
These errors are completely clueless. No clue why its OOM exception is
coming.


20/05/08 15:36:55 INFO Worker: Asked to kill driver
driver-20200508153502-1291

20/05/08 15:36:55 INFO DriverRunner: Killing driver process!

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed

20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
app-20200508153654-11776 removed, cleanupLocalDirs = true

20/05/08 15:36:55 INFO Worker: Driver driver-20200508153502-1291 was killed
by user

*20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[dispatcher-event-loop-6,5,main]*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called

20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
/grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922


On Thu, May 7, 2020 at 10:16 PM Hrishikesh Mishra 
wrote:

> It's only happening for Hadoop config. The exceptions trace are different
> for each time it gets died. And Jobs run for couple hours then worker dies.
>
> Another Reason:
>
> *20/05/02 02:26:34 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[ExecutorRunner for app-20200501213234-9846/3,5,main]*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> * at org.apache.xerces.xni.XMLString.toString(Unknown Source)*
>
> at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown Source)
>
> at org.apache.xerces.xinclude.XIncludeHandler.characters(Unknown Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
>
> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2480)
>
> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2468)
>
> at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2539)
>
> at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
>
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
>
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)
>
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopConfigurations(SparkHadoopUtil.scala:464)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:436)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:114)
>
> at org.apache.spark.SecurityManager.(SecurityManager.scala:114)
>
> at org.apache.spark.deploy.worker.ExecutorRunner.org
> $apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:149)
>
> at
> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
>
> *20/05/02 02:26:37 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[dispatcher-event-loop-3,5,main]*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> * at java.lang.Class.newInstance(Class.java:411)*
>
> at
> sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:403)
>
> at
> sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:394)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at
> sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:393)
>
> at
> sun.reflect.MethodAccessorGe

Spark Window Documentation

2020-05-08 Thread neeraj bhadani
Hi Team,
I was looking for a Spark window function example on documentation.

For example, I could the function definition and params are explained
nicely here:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Window.rowsBetween

and this is the source which is available since spark version 2.1:
https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/window.html#Window.rowsBetween

But I couldn't find an example which helps to understand How it works.

Although, while browsing the GitHub code I have found some example here:
https://github.com/apache/spark/blob/master/python/pyspark/sql/window.py#L83

which I couldn't find on the spark official doc page. Where and how this
example is linked with the official spark documentation.

If such examples are not available, Could you please share the process on
how I can contribute examples to the spark documentation.

Regards,
Neeraj


Re: How to populate all possible combination values in columns using Spark SQL

2020-05-08 Thread Edgardo Szrajber
Have you checked the pivot function?Bentzi

Sent from Yahoo Mail on Android 
 
  On Thu, May 7, 2020 at 22:46, Aakash Basu wrote:  
 Hi,
I've updated the SO question with masked data, added year column and other 
requirement. Please take a look.
Hope this helps in solving the problem.
Thanks and regards,AB
On Thu 7 May, 2020, 10:59 AM Sonal Goyal,  wrote:

As mentioned in the comments on SO, can you provide a (masked) sample of the 
data? It will be easier to see what you are trying to do if you add the year 
column
Thanks,
Sonal
Nube Technologies 





On Thu, May 7, 2020 at 10:26 AM Aakash Basu  wrote:

Hi,
I've described the problem in Stack Overflow with a lot of detailing, can you 
kindly check and help if possible?
https://stackoverflow.com/q/61643910/5536733

I'd be absolutely fine if someone solves it using Spark SQL APIs rather than 
plain spark SQL query.
Thanks,Aakash.

  


Re: No. of active states?

2020-05-08 Thread Edgardo Szrajber
 This should open a new world of real-time metrics for you.How to get Spark 
Metrics as JSON using Spark REST API in YARN Cluster mode


| 
| 
| 
|  |  |

 |

 |
| 
|  | 
How to get Spark Metrics as JSON using Spark REST API in YARN Cluster mode

Anbu Cheeralan

Spark provides the metrics in UI. You can access the UI using either port 4040 
(Standalone) or using a proxy thr...
 |

 |

 |



Bentzi

On Friday, May 8, 2020, 05:30:56 AM GMT+3, Something Something 
 wrote:  
 
 No. We are already capturing these metrics (e.g. numInputRows, 
inputRowsPerSecond).
I am talking about "No. of States" in the memory at any given time. 
On Thu, May 7, 2020 at 4:31 PM Jungtaek Lim  
wrote:

If you're referring total "entries" in all states in SS job, it's being 
provided via StreamingQueryListener.
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#monitoring-streaming-queries

Hope this helps.
On Fri, May 8, 2020 at 3:26 AM Something Something  
wrote:

Is there a way to get the total no. of active states in memory at any given 
point in a Stateful Spark Structured Streaming job? We are thinking of using 
this metric for 'Auto Scaling' our Spark cluster.