This spark.shuffle.sort.bypassMergeThreshold might help, You could also try setting the shuffle manager to hash from sort. You can see more configuration options from here <https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior>.
Thanks Best Regards On Fri, Jul 24, 2015 at 3:33 AM, Mohit Jaggi <mohitja...@gmail.com> wrote: > Hi There, > I am testing Spark DataFrame and havn't been able to get my code to finish > due to what I suspect are GC issues. My guess is that GC interferes with > heartbeating and executors are detected as failed. The data is ~50 numeric > columns, ~100million rows in a CSV file. > We are doing a groupBy using one of the columns and trying to calculate > the average of each of the other columns. The groupBy key has about 250k > unique values. > It seems that Spark is creating a lot of temp objects (see jmap output > below) while calculating the average which I am surprised to see. Why > doesn't it use the same temp variable? Am I missing something? Do I need to > specify a config flag to enable code generation and not do this? > > > Mohit. > > [xxxxx app-20150723142604-0002]$ jmap -histo 12209 > > > num #instances #bytes class name > > ---------------------------------------------- > > 1: 258615458 8275694656 scala.collection.immutable.$colon$colon > > 2: 103435856 7447381632 > org.apache.spark.sql.catalyst.expressions.Cast > > 3: 103435856 4964921088 > org.apache.spark.sql.catalyst.expressions.Coalesce > > 4: 1158643 4257400112 [B > > 5: 51717929 4137434320 > org.apache.spark.sql.catalyst.expressions.SumFunction > > 6: 51717928 3723690816 > org.apache.spark.sql.catalyst.expressions.Add > > 7: 51717929 2896204024 > org.apache.spark.sql.catalyst.expressions.CountFunction > > 8: 51717928 2896203968 > org.apache.spark.sql.catalyst.expressions.MutableLiteral > > 9: 51717928 2482460544 > org.apache.spark.sql.catalyst.expressions.Literal > > 10: 51803728 1243289472 java.lang.Double > > 11: 51717755 1241226120 > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5 > > 12: 975810 850906320 > [Lorg.apache.spark.sql.catalyst.expressions.AggregateFunction; > > 13: 51717754 827484064 > org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$org$apache$spark$sql$catalyst$expressions$Cast$$cast$1 > > 14: 982451 47157648 java.util.HashMap$Entry > > 15: 981132 34981720 [Ljava.lang.Object; > > 16: 1049984 25199616 org.apache.spark.sql.types.UTF8String > > 17: 978296 23479104 > org.apache.spark.sql.catalyst.expressions.GenericRow > > 18: 117166 15944560 <methodKlass> > > 19: 117166 14986224 <constMethodKlass> > > 20: 1567 12891952 [Ljava.util.HashMap$Entry; > > 21: 9103 10249728 <constantPoolKlass> > > 22: 9103 9278592 <instanceKlassKlass> > > 23: 5072 5691320 [I > > 24: 7281 5335040 <constantPoolCacheKlass> > > 25: 46420 4769600 [C > > 26: 105984 3391488 > io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry >