DStream Spark 2.1.1 Streaming on EMR at scale - long running job fails after two hours

2017-07-26 Thread Mikhailau, Alex
Guys, I am trying hard to make a DStream API Spark streaming job work on EMR. I’ve succeeded to the point of running it for a few hours with eventual failure which is when I start seeing some out of memory exception via “yarn logs” aggregate. I am doing a JSON map and extraction of some fields

--jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Mikhailau, Alex
I have log4j json layout jars added via spark-submit on EMR /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --jars /home/hadoop/lib/jsonevent-layout-1.7.jar,/home/hadoop/lib/json-smart-1.1.1.jar --driver-java-options "-XX:+AlwaysPreTouch -XX:MaxPermSize=6G" --class com.mlbam

Re: --jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Mikhailau, Alex
Wed, Aug 9, 2017 at 2:52 PM, Mikhailau, Alex wrote: > I have log4j json layout jars added via spark-submit on EMR > > > > /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --jars > /home/hadoop/lib/jsonevent-layout-1.7.jar,/home/had

Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-28 Thread Mikhailau, Alex
Does anyone have a working solution for logging YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements? Are there specific ENV variables available or other workflow for doing that? Thank you Alex

Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-28 Thread Mikhailau, Alex
MDC way with spark or something other than to achieve this? Alex From: Vadim Semenov Date: Monday, August 28, 2017 at 5:18 PM To: "Mikhailau, Alex" Cc: "user@spark.apache.org" Subject: Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN atte

Re: Referencing YARN application id, YARN container hostname, Executor ID and YARN attempt for jobs running on Spark EMR 5.7.0 in log statements?

2017-08-29 Thread Mikhailau, Alex
Would I use something like this to get to those VM arguments? val runtimeMxBean = ManagementFactory.getRuntimeMXBean val args = runtimeMxBean.getInputArguments val conf = Conf(args) etc. From: Vadim Semenov Date: Tuesday, August 29, 2017 at 11:49 AM To: "Mikhailau, Alex"

Cloudwatch metrics sink problem

2017-08-31 Thread Mikhailau, Alex
I am getting the following in the logs: Sink class org.apache.spark.metrics.sink.CloudwatchSink cannot be instantiated due to CloudwatchSink ClassNotFoundException. I am running this on EMR 5.7.0. Does anyone have experience adding this sink to an EMR cluster? Thanks, Alex

Spark 2.1.1 with Kinesis Receivers is failing to launch 50 active receivers with oversized cluster on EMR Yarn

2017-09-05 Thread Mikhailau, Alex
Guys, I have a Spark 2.1.1 job with Kinesis where it is failing to launch 50 active receivers with oversized cluster on EMR Yarn. It registers sometimes 16, sometimes 32, other times 48 receivers but not all 50. Any help would be greatly appreciated. Kinesis stream shards = 500 YARN EMR CLus

spark metrics prefix in Graphite is duplicated

2017-09-06 Thread Mikhailau, Alex
Hi guys, When I set up my EMR cluster with Spark I add "*.sink.graphite.prefix": "$env.$namespace.$team.$app" to metrics.properties The cluster comes up with correct metrics.properties Then I simply add-step to EMR with spark-submit without any metrics namespace parameter. In my Graphite, Spar

How do I create a JIRA issue and associate it with a PR that I created for a bug in master?

2017-09-12 Thread Mikhailau, Alex
How do I create a JIRA issue and associate it with a PR that I created for a bug in master? https://github.com/apache/spark/pull/19210

Re-sharded kinesis stream starts generating warnings after kinesis shard numbers were doubled

2017-09-13 Thread Mikhailau, Alex
Has anyone seen the following warnings in the log after a kinesis stream has been re-sharded? com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask WARN Cannot get the shard for this ProcessTask, so duplicate KPL user records in the event of resharding will not be dropped during d

Re: Re-sharded kinesis stream starts generating warnings after kinesis shard numbers were doubled

2017-10-04 Thread Mikhailau, Alex
-4454 With 2.2.0 -Alex From: "Mikhailau, Alex" Date: Wednesday, September 13, 2017 at 4:16 PM To: "user@spark.apache.org" Subject: Re-sharded kinesis stream starts generating warnings after kinesis shard numbers were doubled Has anyone seen the following warnings in the

Re: Re-sharded kinesis stream starts generating warnings after kinesis shard numbers were doubled

2017-10-04 Thread Mikhailau, Alex
Filed SPARK-22200 From: "Mikhailau, Alex" Date: Wednesday, October 4, 2017 at 10:43 AM To: "user@spark.apache.org" Subject: Re: Re-sharded kinesis stream starts generating warnings after kinesis shard numbers were doubled Just found the same exact issues in one of our l

does Kinesis Connector for structured streaming auto-scales receivers if a cluster is using dynamic allocation and auto-scaling?

2018-02-01 Thread Mikhailau, Alex
does Kinesis Connector for structured streaming auto-scales receivers if a cluster is using dynamic allocation and auto-scaling?