Unsubscribe
Unsubscribe
Re: Monitor Spark Applications
Hi Raman, The banzaicloud jar can also cover the JMX exports. Thanks, Alex On Fri, Sep 13, 2019 at 8:46 AM raman gugnani wrote: > Hi Alex, > > Thanks will check this out. > > Can it be done directly as spark also exposes the metrics or JVM. In > this my one doubt is how to assign fixed JMX ports to driver and executors. > > @Alex, > Is there any difference in fetching data via JMX or using banzaicloud jar. > > > On Fri, 13 Sep 2019 at 10:47, Alex Landa wrote: > >> Hi, >> We are starting to use https://github.com/banzaicloud/spark-metrics . >> Keep in mind that their solution is for Spark for K8s, to make it work >> for Spark on Yarn you have to copy the dependencies of the spark-metrics >> into Spark Jars folders on all the Spark machines (took me a while to >> figure). >> >> Thanks, >> Alex >> >> On Fri, Sep 13, 2019 at 7:58 AM raman gugnani >> wrote: >> >>> Hi Team, >>> >>> I am new to spark. I am using spark on hortonworks dataplatform with >>> amazon EC2 machines. I am running spark in cluster mode with yarn. >>> >>> I need to monitor individual JVMs and other Spark metrics with >>> *prometheus*. >>> >>> Can anyone suggest the solution to do the same. >>> >>> -- >>> Raman Gugnani >>> >> > > -- > Raman Gugnani >
Re: Monitor Spark Applications
Hi, We are starting to use https://github.com/banzaicloud/spark-metrics . Keep in mind that their solution is for Spark for K8s, to make it work for Spark on Yarn you have to copy the dependencies of the spark-metrics into Spark Jars folders on all the Spark machines (took me a while to figure). Thanks, Alex On Fri, Sep 13, 2019 at 7:58 AM raman gugnani wrote: > Hi Team, > > I am new to spark. I am using spark on hortonworks dataplatform with > amazon EC2 machines. I am running spark in cluster mode with yarn. > > I need to monitor individual JVMs and other Spark metrics with > *prometheus*. > > Can anyone suggest the solution to do the same. > > -- > Raman Gugnani >
Re: How to combine all rows into a single row in DataFrame
Hi, It sounds similar to what we do in our application. We don't serialize every row, but instead we group first the rows into the wanted representation and then apply protobuf serialization using map and lambda. I suggest not to serialize the entire DataFrame into a single protobuf message since it may cause OOM errors. Thanks, Alex On Mon, Aug 19, 2019 at 11:24 PM Rishikesh Gawade wrote: > Hi All, > I have been trying to serialize a dataframe in protobuf format. So far, I > have been able to serialize every row of the dataframe by using map > function and the logic for serialization within the same(within the lambda > function). The resultant dataframe consists of rows in serialized format(1 > row = 1 serialized message). > I wish to form a single protobuf serialized message for this dataframe and > in order to do that i need to combine all the serialized rows using some > custom logic very similar to the one used in map operation. > I am assuming that this would be possible by using the reduce operation on > the dataframe, however, i am unaware of how to go about it. > Any suggestions/approach would be much appreciated. > > Thanks, > Rishikesh >
Re: Spark Standalone - Failing to pass extra java options to the driver in cluster mode
Thanks Jungtaek Lim, I upgraded the cluster to 2.4.3 and it worked fine. Thanks, Alex On Mon, Aug 19, 2019 at 10:01 PM Jungtaek Lim wrote: > Hi Alex, > > you seem to hit SPARK-26606 [1] which has been fixed in 2.4.1. Could you > try it out with latest version? > > Thanks, > Jungtaek Lim (HeartSaVioR) > > 1. https://issues.apache.org/jira/browse/SPARK-26606 > > On Tue, Aug 20, 2019 at 3:43 AM Alex Landa wrote: > >> Hi, >> >> We are using Spark Standalone 2.4.0 in production and publishing our >> Scala app using cluster mode. >> I saw that extra java options passed to the driver don't actually pass. >> A submit example: >> *spark-submit --deploy-mode cluster --master spark://:7077 >> --driver-memory 512mb --conf >> "spark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError" --class >> App app.jar * >> >> Doesn't pass *-XX:+HeapDumpOnOutOfMemoryError *as a JVM argument, but >> pass instead >> *-Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError*I >> created a test app for it: >> >> val spark = SparkSession.builder() >> .master("local") >> .appName("testApp").getOrCreate() >> import spark.implicits._ >> >> // get a RuntimeMXBean reference >> val runtimeMxBean = ManagementFactory.getRuntimeMXBean >> >> // get the jvm's input arguments as a list of strings >> val listOfArguments = runtimeMxBean.getInputArguments >> >> // print the arguments >> listOfArguments.asScala.foreach(a => println(s"ARG: $a")) >> >> >> I see that for client mode I get : >> ARG: -XX:+HeapDumpOnOutOfMemoryError >> while in cluster mode I get: >> ARG: -Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError >> >> Would appreciate your help how to work around this issue. >> Thanks, >> Alex >> >> > > -- > Name : Jungtaek Lim > Blog : http://medium.com/@heartsavior > Twitter : http://twitter.com/heartsavior > LinkedIn : http://www.linkedin.com/in/heartsavior >
Spark Standalone - Failing to pass extra java options to the driver in cluster mode
Hi, We are using Spark Standalone 2.4.0 in production and publishing our Scala app using cluster mode. I saw that extra java options passed to the driver don't actually pass. A submit example: *spark-submit --deploy-mode cluster --master spark://:7077 --driver-memory 512mb --conf "spark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError" --class App app.jar * Doesn't pass *-XX:+HeapDumpOnOutOfMemoryError *as a JVM argument, but pass instead *-Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError*I created a test app for it: val spark = SparkSession.builder() .master("local") .appName("testApp").getOrCreate() import spark.implicits._ // get a RuntimeMXBean reference val runtimeMxBean = ManagementFactory.getRuntimeMXBean // get the jvm's input arguments as a list of strings val listOfArguments = runtimeMxBean.getInputArguments // print the arguments listOfArguments.asScala.foreach(a => println(s"ARG: $a")) I see that for client mode I get : ARG: -XX:+HeapDumpOnOutOfMemoryError while in cluster mode I get: ARG: -Dspark.driver.extraJavaOptions=-XX:+HeapDumpOnOutOfMemoryError Would appreciate your help how to work around this issue. Thanks, Alex
Re: Long-Running Spark application doesn't clean old shuffle data correctly
Hi Keith, I don't think that we keep such references. But we do experience exceptions during the job execution that we catch and retry (timeouts/network issues from different data sources). Can they affect RDD cleanup? Thanks, Alex On Sun, Jul 21, 2019 at 10:49 PM Keith Chapman wrote: > Hi Alex, > > Shuffle files in spark are deleted when the object holding a reference to > the shuffle file on disk goes out of scope (is garbage collected by the > JVM). Could it be the case that you are keeping these objects alive? > > Regards, > Keith. > > http://keith-chapman.com > > > On Sun, Jul 21, 2019 at 12:19 AM Alex Landa wrote: > >> Thanks, >> I looked into these options, the cleaner periodic interval is set to 30 >> min by default. >> The block option for shuffle - >> *spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by >> default. >> What are the implications of setting it to true? >> Will it make the driver slower? >> >> Thanks, >> Alex >> >> On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail < >> prathmesh.ran...@gmail.com> wrote: >> >>> This is the job of ContextCleaner. There are few a property that you can >>> tweak to see if that helps: >>> spark.cleaner.periodicGC.interval >>> spark.cleaner.referenceTracking >>> spark.cleaner.referenceTracking.blocking.shuffle >>> >>> Regards >>> Prathmesh Ranaut >>> >>> On Jul 21, 2019, at 11:31 AM, Alex Landa wrote: >>> >>> Hi, >>> >>> We are running a long running Spark application ( which executes lots of >>> quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0. >>> We see that old shuffle files ( a week old for example ) are not deleted >>> during the execution of the application, which leads to out of disk space >>> errors on the executor. >>> If we re-deploy the application, the Spark cluster take care of the >>> cleaning >>> and deletes the old shuffle data (since we have >>> /-Dspark.worker.cleanup.enabled=true/ in the worker config). >>> I don't want to re-deploy our app every week or two, but to be able to >>> configure spark to clean old shuffle data (as it should). >>> >>> How can I configure Spark to delete old shuffle data during the life >>> time of >>> the application (not after)? >>> >>> >>> Thanks, >>> Alex >>> >>>
Re: Long-Running Spark application doesn't clean old shuffle data correctly
Thanks, I looked into these options, the cleaner periodic interval is set to 30 min by default. The block option for shuffle - *spark.cleaner.referenceTracking.blocking.shuffle* - is set to false by default. What are the implications of setting it to true? Will it make the driver slower? Thanks, Alex On Sun, Jul 21, 2019 at 9:06 AM Prathmesh Ranaut Gmail < prathmesh.ran...@gmail.com> wrote: > This is the job of ContextCleaner. There are few a property that you can > tweak to see if that helps: > spark.cleaner.periodicGC.interval > spark.cleaner.referenceTracking > spark.cleaner.referenceTracking.blocking.shuffle > > Regards > Prathmesh Ranaut > > On Jul 21, 2019, at 11:31 AM, Alex Landa wrote: > > Hi, > > We are running a long running Spark application ( which executes lots of > quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0. > We see that old shuffle files ( a week old for example ) are not deleted > during the execution of the application, which leads to out of disk space > errors on the executor. > If we re-deploy the application, the Spark cluster take care of the > cleaning > and deletes the old shuffle data (since we have > /-Dspark.worker.cleanup.enabled=true/ in the worker config). > I don't want to re-deploy our app every week or two, but to be able to > configure spark to clean old shuffle data (as it should). > > How can I configure Spark to delete old shuffle data during the life time > of > the application (not after)? > > > Thanks, > Alex > >
Long-Running Spark application doesn't clean old shuffle data correctly
Hi, We are running a long running Spark application ( which executes lots of quick jobs using our scheduler ) on Spark stand-alone cluster 2.4.0. We see that old shuffle files ( a week old for example ) are not deleted during the execution of the application, which leads to out of disk space errors on the executor. If we re-deploy the application, the Spark cluster take care of the cleaning and deletes the old shuffle data (since we have /-Dspark.worker.cleanup.enabled=true/ in the worker config). I don't want to re-deploy our app every week or two, but to be able to configure spark to clean old shuffle data (as it should). How can I configure Spark to delete old shuffle data during the life time of the application (not after)? Thanks, Alex