Re: java.lang.OutOfMemoryError Spark Worker

2020-05-12 Thread Hrishikesh Mishra
Configuration:

Driver memory we tried: 2GB / 4GB / 5GB
Executor memory we tried: 4G / 5GB
Even reduced: *spark.memory.fraction *to 0.2  (we are not using cache)
VM Memory: 32 GB and 8 core
We tried for SPARK_WORKER_MEMORY:  30GB / 24GB
SPARK_WORKER_CORES: 32 (because jobs are not CPU bound )
SPARK_WORKER_INSTANCES: 1


What we feel there is not enable space for user classes / objects or clean
up for these is not happening frequently.





On Sat, May 9, 2020 at 12:30 AM Amit Sharma  wrote:

> What memory you are assigning per executor. What is the driver memory
> configuration?
>
>
> Thanks
> Amit
>
> On Fri, May 8, 2020 at 12:59 PM Hrishikesh Mishra 
> wrote:
>
>> We submit spark job through spark-submit command, Like below one.
>>
>>
>> sudo /var/lib/pf-spark/bin/spark-submit \
>> --total-executor-cores 30 \
>> --driver-cores 2 \
>> --class com.hrishikesh.mishra.Main\
>> --master spark://XX.XX.XXX.19:6066  \
>> --deploy-mode cluster  \
>> --supervise
>> http://XX.XX.XXX.19:90/jar/fk-runner-framework-1.0-SNAPSHOT.jar
>>
>>
>>
>>
>> We have python http server, where we hosted all jars.
>>
>> The user kill the driver driver-20200508153502-1291 and its visible in
>> log also, but this is not problem. OOM is separate from this.
>>
>> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
>> driver-20200508153502-1291
>>
>> 20/05/08 15:36:55 INFO DriverRunner: Killing driver process!
>>
>> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
>> /grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed
>>
>> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
>> /grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed
>>
>> 20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
>> app-20200508153654-11776 removed, cleanupLocalDirs = true
>>
>> 20/05/08 *15:36:55* INFO Worker: Driver* driver-20200508153502-1291 was
>> killed by user*
>>
>> *20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
>> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
>> stacktrace] was thrown by a user handler's exceptionCaught() method while
>> handling the following exception:*
>>
>> *java.lang.OutOfMemoryError: Java heap space*
>>
>> *20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught
>> exception in thread Thread[dispatcher-event-loop-6,5,main]*
>>
>> *java.lang.OutOfMemoryError: Java heap space*
>>
>> *20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
>> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
>> stacktrace] was thrown by a user handler's exceptionCaught() method while
>> handling the following exception:*
>>
>> *java.lang.OutOfMemoryError: Java heap space*
>>
>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>
>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>
>> 20/05/08 15:43:33 INFO ExecutorRunner: Killing process!
>>
>> 20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called
>>
>> 20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
>> /grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922
>>
>>
>> On Fri, May 8, 2020 at 9:27 PM Jacek Laskowski  wrote:
>>
>>> Hi,
>>>
>>> It's been a while since I worked with Spark Standalone, but I'd check
>>> the logs of the workers. How do you spark-submit the app?
>>>
>>> DId you check /grid/1/spark/work/driver-20200508153502-1291 directory?
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://about.me/JacekLaskowski
>>> "The Internals Of" Online Books <https://books.japila.pl/>
>>> Follow me on https://twitter.com/jaceklaskowski
>>>
>>> <https://twitter.com/jaceklaskowski>
>>>
>>>
>>> On Fri, May 8, 2020 at 2:32 PM Hrishikesh Mishra 
>>> wrote:
>>>
>>>> Thanks Jacek for quick response.
>>>> Due to our system constraints, we can't move to Structured Streaming
>>>> now. But definitely YARN can be tried out.
>>>>
>>>> But my problem is I'm able to figure out where is the issue, Driver,
>>>> Executor, or Worker. Even exceptions are clueless.  Please see the below
>>>> exception, I'm unable to spot the issue for OOM.
>>>>
>>>> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
>>>> driver-20200508153502-1291
>>>>
>>

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Hrishikesh Mishra
We submit spark job through spark-submit command, Like below one.


sudo /var/lib/pf-spark/bin/spark-submit \
--total-executor-cores 30 \
--driver-cores 2 \
--class com.hrishikesh.mishra.Main\
--master spark://XX.XX.XXX.19:6066  \
--deploy-mode cluster  \
--supervise http://XX.XX.XXX.19:90/jar/fk-runner-framework-1.0-SNAPSHOT.jar




We have python http server, where we hosted all jars.

The user kill the driver driver-20200508153502-1291 and its visible in log
also, but this is not problem. OOM is separate from this.

20/05/08 15:36:55 INFO Worker: Asked to kill driver
driver-20200508153502-1291

20/05/08 15:36:55 INFO DriverRunner: Killing driver process!

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed

20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
app-20200508153654-11776 removed, cleanupLocalDirs = true

20/05/08 *15:36:55* INFO Worker: Driver* driver-20200508153502-1291 was
killed by user*

*20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[dispatcher-event-loop-6,5,main]*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called

20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
/grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922


On Fri, May 8, 2020 at 9:27 PM Jacek Laskowski  wrote:

> Hi,
>
> It's been a while since I worked with Spark Standalone, but I'd check the
> logs of the workers. How do you spark-submit the app?
>
> DId you check /grid/1/spark/work/driver-20200508153502-1291 directory?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Fri, May 8, 2020 at 2:32 PM Hrishikesh Mishra 
> wrote:
>
>> Thanks Jacek for quick response.
>> Due to our system constraints, we can't move to Structured Streaming now.
>> But definitely YARN can be tried out.
>>
>> But my problem is I'm able to figure out where is the issue, Driver,
>> Executor, or Worker. Even exceptions are clueless.  Please see the below
>> exception, I'm unable to spot the issue for OOM.
>>
>> 20/05/08 15:36:55 INFO Worker: Asked to kill driver
>> driver-20200508153502-1291
>>
>> 20/05/08 15:36:55 INFO DriverRunner: Killing driver process!
>>
>> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
>> /grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed
>>
>> 20/05/08 15:36:55 INFO CommandUtils: Redirection to
>> /grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed
>>
>> 20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
>> app-20200508153654-11776 removed, cleanupLocalDirs = true
>>
>> 20/05/08 15:36:55 INFO Worker: Driver driver-20200508153502-1291 was
>> killed by user
>>
>> *20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
>> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
>> stacktrace] was thrown by a user handler's exceptionCaught() method while
>> handling the following exception:*
>>
>> *java.lang.OutOfMemoryError: Java heap space*
>>
>> *20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught
>> exception in thread Thread[dispatcher-event-loop-6,5,main]*
>>
>> *java.lang.OutOfMemoryError: Java heap space*
>>
>> *20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
>> 'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
>> stacktrace] was thrown by a user handler's exceptionCaught() method while
>> handling the following exception:*
>>
>> *java.lang.OutOfMemoryError: Ja

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Hrishikesh Mishra
Thanks Jacek for quick response.
Due to our system constraints, we can't move to Structured Streaming now.
But definitely YARN can be tried out.

But my problem is I'm able to figure out where is the issue, Driver,
Executor, or Worker. Even exceptions are clueless.  Please see the below
exception, I'm unable to spot the issue for OOM.

20/05/08 15:36:55 INFO Worker: Asked to kill driver
driver-20200508153502-1291

20/05/08 15:36:55 INFO DriverRunner: Killing driver process!

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed

20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
app-20200508153654-11776 removed, cleanupLocalDirs = true

20/05/08 15:36:55 INFO Worker: Driver driver-20200508153502-1291 was killed
by user

*20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[dispatcher-event-loop-6,5,main]*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called

20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
/grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922




On Fri, May 8, 2020 at 5:14 PM Jacek Laskowski  wrote:

> Hi,
>
> Sorry for being perhaps too harsh, but when you asked "Am I missing
> something. " and I noticed this "Kafka Direct Stream" and "Spark Standalone
> Cluster. " I immediately thought "Yeah...please upgrade your Spark env to
> use Spark Structured Streaming at the very least and/or use YARN as the
> cluster manager".
>
> Another thought was that the user code (your code) could be leaking
> resources so Spark eventually reports heap-related errors that may not
> necessarily be Spark's.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
>
>
> On Thu, May 7, 2020 at 1:12 PM Hrishikesh Mishra 
> wrote:
>
>> Hi
>>
>> I am getting out of memory error in worker log in streaming jobs in every
>> couple of hours. After this worker dies. There is no shuffle, no
>> aggression, no. caching  in job, its just a transformation.
>> I'm not able to identify where is the problem, driver or executor. And
>> why worker getting dead after the OOM streaming job should die. Am I
>> missing something.
>>
>> Driver Memory:  2g
>> Executor memory: 4g
>>
>> Spark Version:  2.4
>> Kafka Direct Stream
>> Spark Standalone Cluster.
>>
>>
>> 20/05/06 12:52:20 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users  with view permissions: Set(root); groups
>> with view permissions: Set(); users  with modify permissions: Set(root);
>> groups with modify permissions: Set()
>>
>> 20/05/06 12:53:03 ERROR SparkUncaughtExceptionHandler: Uncaught exception
>> in thread Thread[ExecutorRunner for app-20200506124717-10226/0,5,main]
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source)
>>
>> at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source)
>>
>> at org.apache.xerces.impl.XMLScanner.scanComment(Unknown Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanComment(Unknown
>> Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>> Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>> Source)
>>
>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>
>> at 

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-08 Thread Hrishikesh Mishra
These errors are completely clueless. No clue why its OOM exception is
coming.


20/05/08 15:36:55 INFO Worker: Asked to kill driver
driver-20200508153502-1291

20/05/08 15:36:55 INFO DriverRunner: Killing driver process!

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stderr closed: Stream closed

20/05/08 15:36:55 INFO CommandUtils: Redirection to
/grid/1/spark/work/driver-20200508153502-1291/stdout closed: Stream closed

20/05/08 15:36:55 INFO ExternalShuffleBlockResolver: Application
app-20200508153654-11776 removed, cleanupLocalDirs = true

20/05/08 15:36:55 INFO Worker: Driver driver-20200508153502-1291 was killed
by user

*20/05/08 15:43:06 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[dispatcher-event-loop-6,5,main]*

*java.lang.OutOfMemoryError: Java heap space*

*20/05/08 15:43:17 WARN AbstractChannelHandlerContext: An exception
'java.lang.OutOfMemoryError: Java heap space' [enable DEBUG level for full
stacktrace] was thrown by a user handler's exceptionCaught() method while
handling the following exception:*

*java.lang.OutOfMemoryError: Java heap space*

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ExecutorRunner: Killing process!

20/05/08 15:43:33 INFO ShutdownHookManager: Shutdown hook called

20/05/08 15:43:33 INFO ShutdownHookManager: Deleting directory
/grid/1/spark/local/spark-e045e069-e126-4cff-9512-d36ad30ee922


On Thu, May 7, 2020 at 10:16 PM Hrishikesh Mishra 
wrote:

> It's only happening for Hadoop config. The exceptions trace are different
> for each time it gets died. And Jobs run for couple hours then worker dies.
>
> Another Reason:
>
> *20/05/02 02:26:34 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[ExecutorRunner for app-20200501213234-9846/3,5,main]*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> * at org.apache.xerces.xni.XMLString.toString(Unknown Source)*
>
> at org.apache.xerces.parsers.AbstractDOMParser.characters(Unknown Source)
>
> at org.apache.xerces.xinclude.XIncludeHandler.characters(Unknown Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown
> Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
> Source)
>
> at
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
> Source)
>
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>
> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>
> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>
> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>
> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
>
> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2480)
>
> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2468)
>
> at
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2539)
>
> at
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
>
> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
>
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)
>
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopConfigurations(SparkHadoopUtil.scala:464)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:436)
>
> at
> org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:114)
>
> at org.apache.spark.SecurityManager.(SecurityManager.scala:114)
>
> at org.apache.spark.deploy.worker.ExecutorRunner.org
> $apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:149)
>
> at
> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
>
> *20/05/02 02:26:37 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in thread Thread[dispatcher-event-loop-3,5,main]*
>
> *java.lang.OutOfMemoryError: Java heap space*
>
> * at java.lang.Class.newInstance(Class.java:411)*
>
> at
> sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.j

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-07 Thread Hrishikesh Mishra
)
at
org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:436)
at
org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:114)
at
org.apache.spark.deploy.worker.DriverRunner.downloadUserJar(DriverRunner.scala:160)
at
org.apache.spark.deploy.worker.DriverRunner.prepareAndRunDriver(DriverRunner.scala:173)
at
org.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:92)
*20/05/02 22:15:51 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[dispatcher-event-loop-7,5,main]*
*java.lang.OutOfMemoryError: Java heap space*
* at org.apache.spark.deploy.worker.Worker.receive(Worker.scala:443)*
* at
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)*
* at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)*
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/05/02 22:16:05 INFO ExecutorRunner: Killing process!




On Thu, May 7, 2020 at 7:48 PM Jeff Evans 
wrote:

> You might want to double check your Hadoop config files.  From the stack
> trace it looks like this is happening when simply trying to load
> configuration (XML files).  Make sure they're well formed.
>
> On Thu, May 7, 2020 at 6:12 AM Hrishikesh Mishra 
> wrote:
>
>> Hi
>>
>> I am getting out of memory error in worker log in streaming jobs in every
>> couple of hours. After this worker dies. There is no shuffle, no
>> aggression, no. caching  in job, its just a transformation.
>> I'm not able to identify where is the problem, driver or executor. And
>> why worker getting dead after the OOM streaming job should die. Am I
>> missing something.
>>
>> Driver Memory:  2g
>> Executor memory: 4g
>>
>> Spark Version:  2.4
>> Kafka Direct Stream
>> Spark Standalone Cluster.
>>
>>
>> 20/05/06 12:52:20 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users  with view permissions: Set(root); groups
>> with view permissions: Set(); users  with modify permissions: Set(root);
>> groups with modify permissions: Set()
>>
>> 20/05/06 12:53:03 ERROR SparkUncaughtExceptionHandler: Uncaught exception
>> in thread Thread[ExecutorRunner for app-20200506124717-10226/0,5,main]
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source)
>>
>> at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source)
>>
>> at org.apache.xerces.impl.XMLScanner.scanComment(Unknown Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanComment(Unknown
>> Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
>> Source)
>>
>> at
>> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
>> Source)
>>
>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>
>> at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>>
>> at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>>
>> at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>>
>> at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>>
>> at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
>>
>> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2480)
>>
>> at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2468)
>>
>> at
>> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2539)
>>
>> at
>> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
>>
>> at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
>>
>> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)
>>
>> at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopConfigurations(SparkHadoopUtil.scala:464)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:436)
>>
>> at
>> org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:114)
>>
>> at org.apache.spark.SecurityManager.(SecurityManager.scala:114)
>>
>> at org.apache.spark.deploy.worker.ExecutorRunner.org
>> $apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:149)
>>
>> at
>> org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
>>
>> 20/05/06 12:53:38 INFO DriverRunner: Worker shutting down, killing driver
>> driver-20200505181719-1187
>>
>> 20/05/06 12:53:38 INFO DriverRunner: Killing driver process!
>>
>>
>>
>>
>> Regards
>> Hrishi
>>
>


java.lang.OutOfMemoryError Spark Worker

2020-05-07 Thread Hrishikesh Mishra
Hi

I am getting out of memory error in worker log in streaming jobs in every
couple of hours. After this worker dies. There is no shuffle, no
aggression, no. caching  in job, its just a transformation.
I'm not able to identify where is the problem, driver or executor. And why
worker getting dead after the OOM streaming job should die. Am I missing
something.

Driver Memory:  2g
Executor memory: 4g

Spark Version:  2.4
Kafka Direct Stream
Spark Standalone Cluster.


20/05/06 12:52:20 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users  with view permissions: Set(root); groups
with view permissions: Set(); users  with modify permissions: Set(root);
groups with modify permissions: Set()

20/05/06 12:53:03 ERROR SparkUncaughtExceptionHandler: Uncaught exception
in thread Thread[ExecutorRunner for app-20200506124717-10226/0,5,main]

java.lang.OutOfMemoryError: Java heap space

at org.apache.xerces.util.XMLStringBuffer.append(Unknown Source)

at org.apache.xerces.impl.XMLEntityScanner.scanData(Unknown Source)

at org.apache.xerces.impl.XMLScanner.scanComment(Unknown Source)

at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanComment(Unknown
Source)

at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)

at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)

at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)

at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)

at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)

at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)

at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2480)

at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2468)

at
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2539)

at
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)

at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)

at org.apache.hadoop.conf.Configuration.set(Configuration.java:1143)

at org.apache.hadoop.conf.Configuration.set(Configuration.java:1115)

at
org.apache.spark.deploy.SparkHadoopUtil$.org$apache$spark$deploy$SparkHadoopUtil$$appendS3AndSparkHadoopConfigurations(SparkHadoopUtil.scala:464)

at
org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:436)

at
org.apache.spark.deploy.SparkHadoopUtil.newConfiguration(SparkHadoopUtil.scala:114)

at org.apache.spark.SecurityManager.(SecurityManager.scala:114)

at org.apache.spark.deploy.worker.ExecutorRunner.org
$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:149)

at
org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)

20/05/06 12:53:38 INFO DriverRunner: Worker shutting down, killing driver
driver-20200505181719-1187

20/05/06 12:53:38 INFO DriverRunner: Killing driver process!




Regards
Hrishi


Re: Spark Streaming on Compact Kafka topic - consumers 1 message per partition per batch

2020-04-08 Thread Hrishikesh Mishra
It seems, I found the issue. The actual problem is something related to
back pressure. When I am adding these config
*spark.streaming.kafka.maxRatePerPartition* or
*spark.streaming.backpressure.initialRate* (the of these configs are 100).
After that it starts consuming one message per partition per batch. Not why
it's happening.


On Thu, Apr 2, 2020 at 8:48 AM Waleed Fateem 
wrote:

> Well this is interesting. Not sure if this is the expected behavior. The
> log messages you have referenced are actually printed out by the Kafka
> Consumer itself (org.apache.kafka.clients.consumer.internals.Fetcher).
>
> That log message belongs to a new feature added starting with Kafka 1.1:
> https://issues.apache.org/jira/browse/KAFKA-6397
>
> I'm assuming then that you're using Spark 2.4?
>
> From Kafka's perspective, when you do a describe on your
> demandIngestion.SLTarget topic, does that look okay? All partitions are
> available with a valid leader.
>
> The other thing I'm curious about, after you
> enabled spark.streaming.kafka.allowNonConsecutiveOffsets, did you try going
> back to the older group.id and do you see the same behavior? Was there a
> reason you chose to start reading again from the beginning by using a new
> consumer group rather then sticking to the same consumer group?
>
> In your application, are you manually committing offsets to Kafka?
>
> Regards,
>
> Waleed
>
> On Wed, Apr 1, 2020 at 1:31 AM Hrishikesh Mishra 
> wrote:
>
>> Hi
>>
>> Our Spark streaming job was working fine as expected (the number of
>> events to process in a batch). But due to some reasons, we added compaction
>> on Kafka topic and restarted the job. But after restart it was failing for
>> below reason:
>>
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 16 in stage 2.0 failed 4 times, most recent failure: Lost task 16.3 in
>> stage 2.0 (TID 231, 10.34.29.38, executor 4):
>> java.lang.IllegalArgumentException: requirement failed: Got wrong record
>> for spark-executor-pc-nfr-loop-31-march-2020-4 demandIngestion.SLTarget-39
>> even after seeking to offset 106847 got offset 199066 instead. If this is a
>> compacted topic, consider enabling
>> spark.streaming.kafka.allowNonConsecutiveOffsets
>>   at scala.Predef$.require(Predef.scala:224)
>>   at
>> org.apache.spark.streaming.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:146)
>>
>>
>>
>> So, I added spark.streaming.kafka.allowNonConsecutiveOffsets: true  in
>> spark config and I changed the group name to consume from beginning. Now
>> the problem is, it reading only one message from per partition. So if a
>> topic has 50 partitions then its reading 50 message per batch (batch
>> duration is 5 sec).
>>
>> The topic is 1M records and consumer has huge lag.
>>
>>
>> Driver log which fetches 1 message per partition.
>>
>> 20/03/31 18:25:55 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211951.
>> 20/03/31 18:26:00 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211952.
>> 20/03/31 18:26:05 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211953.
>> 20/03/31 18:26:10 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211954.
>> 20/03/31 18:26:15 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211955.
>> 20/03/31 18:26:20 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211956.
>> 20/03/31 18:26:25 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211957.
>> 20/03/31 18:26:30 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211958.
>> 20/03/31 18:26:35 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211959.
>> 20/03/31 18:26:40 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211960.
>> 20/03/31 18:26:45 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
>> Resetting offset for partition demandIngestion.SLTarget-45 to offset 211961.
>> 20/03/31 18:26:50 INFO Fetcher: [groupId=pc-nfr-loop-

Spark Streaming on Compact Kafka topic - consumers 1 message per partition per batch

2020-04-01 Thread Hrishikesh Mishra
Hi

Our Spark streaming job was working fine as expected (the number of events
to process in a batch). But due to some reasons, we added compaction on
Kafka topic and restarted the job. But after restart it was failing for
below reason:


org.apache.spark.SparkException: Job aborted due to stage failure: Task 16
in stage 2.0 failed 4 times, most recent failure: Lost task 16.3 in stage
2.0 (TID 231, 10.34.29.38, executor 4): java.lang.IllegalArgumentException:
requirement failed: Got wrong record for
spark-executor-pc-nfr-loop-31-march-2020-4 demandIngestion.SLTarget-39 even
after seeking to offset 106847 got offset 199066 instead. If this is a
compacted topic, consider enabling
spark.streaming.kafka.allowNonConsecutiveOffsets
  at scala.Predef$.require(Predef.scala:224)
  at
org.apache.spark.streaming.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:146)



So, I added spark.streaming.kafka.allowNonConsecutiveOffsets: true  in
spark config and I changed the group name to consume from beginning. Now
the problem is, it reading only one message from per partition. So if a
topic has 50 partitions then its reading 50 message per batch (batch
duration is 5 sec).

The topic is 1M records and consumer has huge lag.


Driver log which fetches 1 message per partition.

20/03/31 18:25:55 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211951.
20/03/31 18:26:00 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211952.
20/03/31 18:26:05 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211953.
20/03/31 18:26:10 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211954.
20/03/31 18:26:15 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211955.
20/03/31 18:26:20 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211956.
20/03/31 18:26:25 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211957.
20/03/31 18:26:30 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211958.
20/03/31 18:26:35 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211959.
20/03/31 18:26:40 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211960.
20/03/31 18:26:45 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211961.
20/03/31 18:26:50 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211962.
20/03/31 18:26:55 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211963.
20/03/31 18:27:00 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211964.
20/03/31 18:27:05 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211965.
20/03/31 18:27:10 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211966.
20/03/31 18:27:15 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211967.
20/03/31 18:27:20 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211968.
20/03/31 18:27:25 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45 to offset 211969.
20/03/31 18:27:30 INFO Fetcher: [groupId=pc-nfr-loop-31-march-2020-4]
Resetting offset for partition demandIngestion.SLTarget-45  to offset
211970.



Spark Config (batch.duration: 5, using Spark Stream) :

  spark.shuffle.service.enabled: "true"

  spark.streaming.backpressure.enabled: "true"

  spark.streaming.concurrentJobs: "1"

  spark.executor.extraJavaOptions: "-XX:+UseConcMarkSweepGC"

  spark.streaming.backpressure.pid.minRate: 1500

  spark.streaming.backpressure.initialRate: 100

  spark.streaming.kafka.allowNonConsecutiveOffsets: true



Is there any issue in my configuration or something special required with
compact Kafka topic which I'm missing?




Regards
Hrishi