You can simply follow these http://spark.apache.org/docs/1.2.0/tuning.html
Thanks Best Regards On Sun, Feb 22, 2015 at 1:14 AM, java8964 <java8...@hotmail.com> wrote: > Can someone share some ideas about how to tune the GC time? > > Thanks > > ------------------------------ > From: java8...@hotmail.com > To: user@spark.apache.org > Subject: Spark performance tuning > Date: Fri, 20 Feb 2015 16:04:23 -0500 > > > Hi, > > I am new to Spark, and I am trying to test the Spark SQL performance vs > Hive. I setup a standalone box, with 24 cores and 64G memory. > > We have one SQL in mind to test. Here is the basically setup on this one > box for the SQL we are trying to run: > > 1) Dataset 1, 6.6G AVRO file with snappy compression, which contains nest > structure of 3 array of struct in AVRO > 2) Dataset2, 5G AVRO file with snappy compression > 3) Dataset3, 2.3M AVRO file with snappy compression. > > The basic structure of the query is like this: > > > (select > xxx > from > dataset1 lateral view outer explode(struct1) lateral view outer > explode(struct2) > where xxxxx ) > left outer join > ( > select xxxx from dataset2 lateral view explode(xxx) where xxxx > ) > on xxxx > left outer join > ( > select xxx from dataset3 where xxxx) > on xxxxx > > So overall what it does is 2 outer explode on dataset1, left outer join > with explode of dataset2, then finally left outer join with dataset 3. > > On this standalone box, I installed Hadoop 2.2 and Hive 0.12, and Spark > 1.2.0. > > Baseline, the above query can finish around 50 minutes in Hive 12, with 6 > mappers and 3 reducers, each with 1G max heap, in 3 rounds of MR jobs. > > This is a very expensive query running in our production, of course with > much bigger data set, every day. Now I want to see how fast Spark can do > for the same query. > > I am using the following settings, based on my understanding of Spark, for > a fair test between it and Hive: > > export SPARK_WORKER_MEMORY=32g > export SPARK_DRIVER_MEMORY=2g > --executor-memory 9g > --total-executor-cores 9 > > I am trying to run the one executor with 9 cores and max 9G heap, to make > Spark use almost same resource we gave to the MapReduce. > Here is the result without any additional configuration changes, running > under Spark 1.2.0, using HiveContext in Spark SQL, to run the exactly same > query: > > The Spark SQL generated 5 stage of tasks, shown below: > 4 collect at SparkPlan.scala:84 +details 2015/02/20 10:48:46 *26 s* > 200/200 > 3 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:32:07 *16 > min* 200/200 1112.3 MB > 2 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *9 > min* 40/40 4.7 GB 22.2 GB > 1 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *1.9* > min 50/50 6.2 GB 2.8 GB > 0 mapPartitions at Exchange.scala:64 +details 2015/02/20 10:22:06 *6 s* > 2/2 2.3 MB 156.6 KB > > So the wall time of whole query is 26s + 16m + 9m + 2m + 6s, around 28 > minutes. > > It is about 56% of originally time, not bad. But I want to know any tuning > of Spark can make it even faster. > > For stage 2 and 3, I observed that GC time is more and more expensive. > Especially in stage 3, shown below: > > For stage 3: > Metric Min 25th percentile Median 75th percentile Max > Duration 20 s 30 s 35 s 39 s > 2.4 min > GC Time 9 s 17 s 20 s 25 s > 2.2 min > Shuffle Write 4.7 MB 4.9 MB 5.2 MB 6.1 MB > 8.3 MB > > So in median, the GC time took overall 20s/35s = 57% of time. > > First change I made is to add the following line in the spark-default.conf: > spark.serializer org.apache.spark.serializer.KryoSerializer > > My assumption is that using kryoSerializer, instead of default java > serialize, will lower the memory footprint, should lower the GC pressure > during runtime. I know the I changed the correct spark-default.conf, > because if I were add "spark.executor.extraJavaOptions -verbose:gc > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" in the same file, I will see > the GC usage in the stdout file. Of course, in this test, I didn't add > that, as I want to only make one change a time. > The result is almost the same, as using standard java serialize. The wall > time is still 28 minutes, and in stage 3, the GC still took around 50 to > 60% of time, almost same result within "min, median to max" in stage 3, > without any noticeable performance gain. > > Next, based on my understanding, and for this test, I think the > default spark.storage.memoryFraction is too high for this query, as there > is no reason to reserve so much memory for caching data, Because we don't > reuse any dataset in this one query. So I add this at the end of > spark-shell command "--conf spark.storage.memoryFraction=0.3", as I want to > just reserve half of the memory for caching data vs first time. Of course, > this time, I rollback the first change of "KryoSerializer". > > The result looks like almost the same. The whole query finished around 28s > + 14m + 9.6m + 1.9m + 6s = 27 minutes. > > It looks like that Spark is faster than Hive, but is there any steps I can > make it even faster? Why using "KryoSerializer" makes no difference? If I > want to use the same resource as now, anything I can do to speed it up > more, especially lower the GC time? > > Thanks > > Yong > >