Re: Metrics Problem

2020-06-26 Thread Srinivas V
One option is to create your main jar included with metrics jar like a fat
jar.

On Sat, Jun 27, 2020 at 8:04 AM Bryan Jeffrey 
wrote:

> Srinivas,
>
> Thanks for the insight. I had not considered a dependency issue as the
> metrics jar works well applied on the driver. Perhaps my main jar
> includes the Hadoop dependencies but the metrics jar does not?
>
> I am confused as the only Hadoop dependency also exists for the built in
> metrics providers which appear to work.
>
> Regards,
>
> Bryan
>
> Get Outlook for Android 
>
> --
> *From:* Srinivas V 
> *Sent:* Friday, June 26, 2020 9:47:52 PM
> *To:* Bryan Jeffrey 
> *Cc:* user 
> *Subject:* Re: Metrics Problem
>
> It should work when you are giving hdfs path as long as your jar exists in
> the path.
> Your error is more security issue (Kerberos) or Hadoop dependencies
> missing I think, your error says :
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation
>
> On Fri, Jun 26, 2020 at 8:44 PM Bryan Jeffrey 
> wrote:
>
> It may be helpful to note that I'm running in Yarn cluster mode.  My goal
> is to avoid having to manually distribute the JAR to all of the various
> nodes as this makes versioning deployments difficult.
>
> On Thu, Jun 25, 2020 at 5:32 PM Bryan Jeffrey 
> wrote:
>
> Hello.
>
> I am running Spark 2.4.4. I have implemented a custom metrics producer. It
> works well when I run locally, or specify the metrics producer only for the
> driver.  When I ask for executor metrics I run into ClassNotFoundExceptions
>
> *Is it possible to pass a metrics JAR via --jars?  If so what am I
> missing?*
>
> Deploy driver stats via:
> --jars hdfs:///custommetricsprovider.jar
> --conf
> spark.metrics.conf.driver.sink.metrics.class=org.apache.spark.mycustommetricssink
>
> However, when I pass the JAR with the metrics provider to executors via:
> --jars hdfs:///custommetricsprovider.jar
> --conf
> spark.metrics.conf.executor.sink.metrics.class=org.apache.spark.mycustommetricssink
>
> I get ClassNotFoundException:
>
> 20/06/25 21:19:35 ERROR MetricsSystem: Sink class
> org.apache.spark.custommetricssink cannot be instantiated
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.custommetricssink
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
> at
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at
> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
> at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:221)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> ... 4 more
>
> Is it possible to pass a metrics JAR via --jars?  If so what am I missing?
>
> Thank you,
>
> Bryan
>
>


Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record

2020-06-26 Thread Russell Spitzer
The connector uses Java driver cql request under the hood which means it
responds to the changing database like a normal application would. This
means retries may result in a different set of data than the original
request if the underlying database changed.

On Fri, Jun 26, 2020, 9:42 PM Jungtaek Lim 
wrote:

> I'm not sure how it is implemented, but in general I wouldn't expect such
> behavior on the connectors which read from non-streaming fashion storages.
> The query result may depend on "when" the records are fetched.
>
> If you need to reflect the changes in your query you'll probably want to
> find a way to retrieve "change logs" from your external storage (or how
> your system/product can also produce change logs if your external storage
> doesn't support it), and adopt it to your query. There's a keyword you can
> google to read further, "Change Data Capture".
>
> Otherwise, you can apply the traditional approach, run a batch query
> periodically and replace entire outputs.
>
> On Thu, Jun 25, 2020 at 1:26 PM Rahul Kumar 
> wrote:
>
>> Hello everyone,
>>
>> I was wondering, how Cassandra spark connector deals with deleted/updated
>> record while readstream operation. If the record was already fetched in
>> spark memory, and it got updated or deleted in database, does it get
>> reflected in streaming join?
>>
>> Thanks,
>> Rahul
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: [Structured spak streaming] How does cassandra connector readstream deals with deleted record

2020-06-26 Thread Jungtaek Lim
I'm not sure how it is implemented, but in general I wouldn't expect such
behavior on the connectors which read from non-streaming fashion storages.
The query result may depend on "when" the records are fetched.

If you need to reflect the changes in your query you'll probably want to
find a way to retrieve "change logs" from your external storage (or how
your system/product can also produce change logs if your external storage
doesn't support it), and adopt it to your query. There's a keyword you can
google to read further, "Change Data Capture".

Otherwise, you can apply the traditional approach, run a batch query
periodically and replace entire outputs.

On Thu, Jun 25, 2020 at 1:26 PM Rahul Kumar  wrote:

> Hello everyone,
>
> I was wondering, how Cassandra spark connector deals with deleted/updated
> record while readstream operation. If the record was already fetched in
> spark memory, and it got updated or deleted in database, does it get
> reflected in streaming join?
>
> Thanks,
> Rahul
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Metrics Problem

2020-06-26 Thread Bryan Jeffrey
Srinivas,

Thanks for the insight. I had not considered a dependency issue as the metrics 
jar works well applied on the driver. Perhaps my main jar includes the Hadoop 
dependencies but the metrics jar does not?

I am confused as the only Hadoop dependency also exists for the built in 
metrics providers which appear to work.

Regards,

Bryan

Get Outlook for Android


From: Srinivas V 
Sent: Friday, June 26, 2020 9:47:52 PM
To: Bryan Jeffrey 
Cc: user 
Subject: Re: Metrics Problem

It should work when you are giving hdfs path as long as your jar exists in the 
path.
Your error is more security issue (Kerberos) or Hadoop dependencies missing I 
think, your error says :
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation

On Fri, Jun 26, 2020 at 8:44 PM Bryan Jeffrey 
mailto:bryan.jeff...@gmail.com>> wrote:
It may be helpful to note that I'm running in Yarn cluster mode.  My goal is to 
avoid having to manually distribute the JAR to all of the various nodes as this 
makes versioning deployments difficult.

On Thu, Jun 25, 2020 at 5:32 PM Bryan Jeffrey 
mailto:bryan.jeff...@gmail.com>> wrote:
Hello.

I am running Spark 2.4.4. I have implemented a custom metrics producer. It 
works well when I run locally, or specify the metrics producer only for the 
driver.  When I ask for executor metrics I run into ClassNotFoundExceptions

Is it possible to pass a metrics JAR via --jars?  If so what am I missing?

Deploy driver stats via:
--jars hdfs:///custommetricsprovider.jar
--conf 
spark.metrics.conf.driver.sink.metrics.class=org.apache.spark.mycustommetricssink

However, when I pass the JAR with the metrics provider to executors via:
--jars hdfs:///custommetricsprovider.jar
--conf 
spark.metrics.conf.executor.sink.metrics.class=org.apache.spark.mycustommetricssink

I get ClassNotFoundException:

20/06/25 21:19:35 ERROR MetricsSystem: Sink class 
org.apache.spark.custommetricssink cannot be instantiated
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.custommetricssink
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
at 
org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
at org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:221)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
... 4 more

Is it possible to pass a metrics JAR via --jars?  If so what am I missing?

Thank you,

Bryan


Re: Metrics Problem

2020-06-26 Thread Srinivas V
It should work when you are giving hdfs path as long as your jar exists in
the path.
Your error is more security issue (Kerberos) or Hadoop dependencies missing
I think, your error says :
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation

On Fri, Jun 26, 2020 at 8:44 PM Bryan Jeffrey 
wrote:

> It may be helpful to note that I'm running in Yarn cluster mode.  My goal
> is to avoid having to manually distribute the JAR to all of the various
> nodes as this makes versioning deployments difficult.
>
> On Thu, Jun 25, 2020 at 5:32 PM Bryan Jeffrey 
> wrote:
>
>> Hello.
>>
>> I am running Spark 2.4.4. I have implemented a custom metrics producer.
>> It works well when I run locally, or specify the metrics producer only for
>> the driver.  When I ask for executor metrics I run into
>> ClassNotFoundExceptions
>>
>> *Is it possible to pass a metrics JAR via --jars?  If so what am I
>> missing?*
>>
>> Deploy driver stats via:
>> --jars hdfs:///custommetricsprovider.jar
>> --conf
>> spark.metrics.conf.driver.sink.metrics.class=org.apache.spark.mycustommetricssink
>>
>> However, when I pass the JAR with the metrics provider to executors via:
>> --jars hdfs:///custommetricsprovider.jar
>> --conf
>> spark.metrics.conf.executor.sink.metrics.class=org.apache.spark.mycustommetricssink
>>
>> I get ClassNotFoundException:
>>
>> 20/06/25 21:19:35 ERROR MetricsSystem: Sink class
>> org.apache.spark.custommetricssink cannot be instantiated
>> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> org.apache.spark.custommetricssink
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:348)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
>> at
>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
>> at
>> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
>> at
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
>> at
>> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
>> at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
>> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
>> at
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:221)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
>> at
>> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>> ... 4 more
>>
>> Is it possible to pass a metrics JAR via --jars?  If so what am I missing?
>>
>> Thank you,
>>
>> Bryan
>>
>


Re: Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Srinivas V
Cool. Are you not using watermark ?
Also, is it possible to start listening offsets from a specific date time ?

Regards
Srini

On Sat, Jun 27, 2020 at 6:12 AM Eric Beabes 
wrote:

> My apologies...  After I set the 'maxOffsetsPerTrigger' to a value such as
> '20' it started working. Hopefully this will help someone. Thanks.
>
> On Fri, Jun 26, 2020 at 2:12 PM Something Something <
> mailinglist...@gmail.com> wrote:
>
>> My Spark Structured Streaming job works fine when I set "startingOffsets"
>> to "latest". When I simply change it to "earliest" & specify a new "check
>> point directory", the job doesn't work. The states don't get timed out
>> after 10 minutes.
>>
>> While debugging I noticed that my 'state' logic is indeed getting
>> executed but states just don't time out - as they do when I use "latest".
>> Any reason why?
>>
>> Is this a known issue?
>>
>> *Note*: I've tried this under Spark 2.3 & 2.4
>>
>


Re: Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Eric Beabes
My apologies...  After I set the 'maxOffsetsPerTrigger' to a value such as
'20' it started working. Hopefully this will help someone. Thanks.

On Fri, Jun 26, 2020 at 2:12 PM Something Something <
mailinglist...@gmail.com> wrote:

> My Spark Structured Streaming job works fine when I set "startingOffsets"
> to "latest". When I simply change it to "earliest" & specify a new "check
> point directory", the job doesn't work. The states don't get timed out
> after 10 minutes.
>
> While debugging I noticed that my 'state' logic is indeed getting executed
> but states just don't time out - as they do when I use "latest". Any reason
> why?
>
> Is this a known issue?
>
> *Note*: I've tried this under Spark 2.3 & 2.4
>


Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Something Something
My Spark Structured Streaming job works fine when I set "startingOffsets"
to "latest". When I simply change it to "earliest" & specify a new "check
point directory", the job doesn't work. The states don't get timed out
after 10 minutes.

While debugging I noticed that my 'state' logic is indeed getting executed
but states just don't time out - as they do when I use "latest". Any reason
why?

Is this a known issue?

*Note*: I've tried this under Spark 2.3 & 2.4


Re: apache-spark mongodb dataframe issue

2020-06-26 Thread Mannat Singh
Hi Jeff
Thanks for confirming the same.

I have also thought about reading every MongoDB document separately along
with their schemas and then comparing them to the schemas of all the
documents in the collection. For our huge database this is a horrible
horrible approach as you have already mentioned.

I am doing RnD on another approach, will post here if there is a
breakthrough.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Data Explosion and repartition before group bys

2020-06-26 Thread lsn24
Hi ,

 We have a use case where one record  needs to be in two different
aggregations.

Say for example a credit card  transaction "A",  which  belongs to 
transaction category ATM and crossborder.

If I need to take the count of ATM transaction,  I need to consider
transaction A . For count of crossBorder transactions too I need to consider 
transaction A.

If this has to run in parallel, we decided to go with data explosion.  So
that transaction A can be  aggregate twice.

Question:
   1. Is Data explosion the only way to address it ?
   2. The data has skew, so it runs out of executor memory when we tried to
aggregate. Repartition after the data explosion to address the data skew is
killing us.

What other ways can we address this problem ?

Note : A transaction is marked as an ATM transaction  or a cross border
transaction by a boolean  value.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Metrics Problem

2020-06-26 Thread Bryan Jeffrey
It may be helpful to note that I'm running in Yarn cluster mode.  My goal
is to avoid having to manually distribute the JAR to all of the various
nodes as this makes versioning deployments difficult.

On Thu, Jun 25, 2020 at 5:32 PM Bryan Jeffrey 
wrote:

> Hello.
>
> I am running Spark 2.4.4. I have implemented a custom metrics producer. It
> works well when I run locally, or specify the metrics producer only for the
> driver.  When I ask for executor metrics I run into ClassNotFoundExceptions
>
> *Is it possible to pass a metrics JAR via --jars?  If so what am I
> missing?*
>
> Deploy driver stats via:
> --jars hdfs:///custommetricsprovider.jar
> --conf
> spark.metrics.conf.driver.sink.metrics.class=org.apache.spark.mycustommetricssink
>
> However, when I pass the JAR with the metrics provider to executors via:
> --jars hdfs:///custommetricsprovider.jar
> --conf
> spark.metrics.conf.executor.sink.metrics.class=org.apache.spark.mycustommetricssink
>
> I get ClassNotFoundException:
>
> 20/06/25 21:19:35 ERROR MetricsSystem: Sink class
> org.apache.spark.custommetricssink cannot be instantiated
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
> at
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.custommetricssink
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:348)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:238)
> at
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:198)
> at
> org.apache.spark.metrics.MetricsSystem$$anonfun$registerSinks$1.apply(MetricsSystem.scala:194)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at
> org.apache.spark.metrics.MetricsSystem.registerSinks(MetricsSystem.scala:194)
> at org.apache.spark.metrics.MetricsSystem.start(MetricsSystem.scala:102)
> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
> at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:201)
> at
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:221)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
> at
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> ... 4 more
>
> Is it possible to pass a metrics JAR via --jars?  If so what am I missing?
>
> Thank you,
>
> Bryan
>


Re: Spark 3 pod template for the driver

2020-06-26 Thread Michel Sumbul
Hi Jorge,
If I set that in the spark submit command it works but I want it only in
the pod template file.

Best regards,
Michel

Le ven. 26 juin 2020 à 14:01, Jorge Machado  a écrit :

> Try to set spark.kubernetes.container.image
>
> On 26. Jun 2020, at 14:58, Michel Sumbul  wrote:
>
> Hi guys,
>
> I try to use Spark 3 on top of Kubernetes and to specify a pod template
> for the driver.
>
> Here is my pod manifest or the driver and when I do a spark-submit with
> the option:
> --conf
> spark.kubernetes.driver.podTemplateFile=/data/k8s/podtemplate_driver3.yaml
>
> I got the error message that I need to specify an image, but it's the
> manifest.
> Does my manifest file is wrong, How should it look like?
>
> Thanks for your help,
> Michel
>
> 
> The pod manifest:
>
> apiVersion: v1
> kind: Pod
> metadata:
>   name: mySpark3App
>   labels:
> app: mySpark3App
> customlabel/app-id: "1"
> spec:
>   securityContext:
> runAsUser: 1000
>   volumes:
> - name: "test-volume"
>   emptyDir: {}
>   containers:
> - name: spark3driver
>   image: mydockerregistry.example.com/images/dev/spark3:latest
>   instances: 1
>   resources:
> requests:
>   cpu: "1000m"
>   memory: "512Mi"
> limits:
>   cpu: "1000m"
>   memory: "512Mi"
>   volumeMounts:
>- name: "test-volume"
>  mountPath: "/tmp"
>
>
>


Re: Spark 3 pod template for the driver

2020-06-26 Thread Jorge Machado
Try to set spark.kubernetes.container.image

> On 26. Jun 2020, at 14:58, Michel Sumbul  wrote:
> 
> Hi guys,
> 
> I try to use Spark 3 on top of Kubernetes and to specify a pod template for 
> the driver.
> 
> Here is my pod manifest or the driver and when I do a spark-submit with the 
> option:
> --conf 
> spark.kubernetes.driver.podTemplateFile=/data/k8s/podtemplate_driver3.yaml
> 
> I got the error message that I need to specify an image, but it's the 
> manifest.
> Does my manifest file is wrong, How should it look like?
> 
> Thanks for your help,
> Michel
> 
> 
> The pod manifest:
> 
> apiVersion: v1
> kind: Pod
> metadata:
>   name: mySpark3App
>   labels:
> app: mySpark3App
> customlabel/app-id: "1"
> spec:
>   securityContext:
> runAsUser: 1000
>   volumes:
> - name: "test-volume"
>   emptyDir: {}
>   containers:
> - name: spark3driver
>   image: mydockerregistry.example.com/images/dev/spark3:latest 
> 
>   instances: 1
>   resources:
> requests:
>   cpu: "1000m"
>   memory: "512Mi"
> limits:
>   cpu: "1000m"
>   memory: "512Mi"
>   volumeMounts:
>- name: "test-volume"
>  mountPath: "/tmp"



Spark 3 pod template for the driver

2020-06-26 Thread Michel Sumbul
Hi guys,
I try to use Spark 3 on top of Kubernetes and to specify a pod template for the 
driver.
Here is my pod manifest or the driver and when I do a spark-submit with the 
option:
--conf 
spark.kubernetes.driver.podTemplateFile=/data/k8s/podtemplate_driver3.yaml

I got the error message that I need to specify an image, but its the manifest.

Does my manifest file is wrong, how should it look like?

Thanks for your help,
Michel
The pod manifest:
apiVersion: v1kind: Podmetadata:  name: mySpark3App  labels:    app: 
mySpark3App    customlabel/app-id: "1"spec:  securityContext:    runAsUser: 
1000  volumes:    - name: "test-volume"      emptyDir: {}  containers:    - 
name: spark3driver      image: 
mydockerregistry.example.com/images/dev/spark3:latest      instances: 1      
resources:        requests:          cpu: "1000m"          memory: "512Mi"      
  limits:          cpu: "1000m"          memory: "512Mi"      volumeMounts:     
  - name: "test-volume"         mountPath: "/tmp"


Spark 3 pod template for the driver

2020-06-26 Thread Michel Sumbul
Hi guys,

I try to use Spark 3 on top of Kubernetes and to specify a pod template for
the driver.

Here is my pod manifest or the driver and when I do a spark-submit with the
option:
--conf
spark.kubernetes.driver.podTemplateFile=/data/k8s/podtemplate_driver3.yaml

I got the error message that I need to specify an image, but it's the
manifest.
Does my manifest file is wrong, How should it look like?

Thanks for your help,
Michel


The pod manifest:

apiVersion: v1
kind: Pod
metadata:
  name: mySpark3App
  labels:
app: mySpark3App
customlabel/app-id: "1"
spec:
  securityContext:
runAsUser: 1000
  volumes:
- name: "test-volume"
  emptyDir: {}
  containers:
- name: spark3driver
  image: mydockerregistry.example.com/images/dev/spark3:latest
  instances: 1
  resources:
requests:
  cpu: "1000m"
  memory: "512Mi"
limits:
  cpu: "1000m"
  memory: "512Mi"
  volumeMounts:
   - name: "test-volume"
 mountPath: "/tmp"