Re: Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread Pramod Biligiri
Hi Zhang, No my data is not compressed. I'm trying to minimize the load on the CPU. The GC time reduced for me after codegen. Pramod On Thu, May 21, 2015 at 3:43 AM, zhangxiongfei wrote: > Hi Pramod > > Is your data compressed? I encountered similar problem,however, after > turned codegen on,

Re:Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-21 Thread zhangxiongfei
Hi Pramod Is your data compressed? I encountered similar problem,however, after turned codegen on, the GC time was still very long.The size of input data for my map task is about 100M lzo file. My query is ""select ip, count(*) as c from stage_bitauto_adclick_d group by ip sort by c limit 10

Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-20 Thread Reynold Xin
Yup it is a different path. It runs GeneratedAggregate. On Wed, May 20, 2015 at 11:43 PM, Pramod Biligiri wrote: > I hadn't turned on codegen. I enabled it and ran it again, it is running > 4-5 times faster now! :) > Since my log statements are no longer appearing, I presume the code path > seem

Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-20 Thread Pramod Biligiri
I hadn't turned on codegen. I enabled it and ran it again, it is running 4-5 times faster now! :) Since my log statements are no longer appearing, I presume the code path seems quite different from the earlier hashmap related stuff in Aggregates.scala? Pramod On Wed, May 20, 2015 at 9:18 PM, Reyn

Re: Low throughput and effect of GC in SparkSql GROUP BY

2015-05-20 Thread Reynold Xin
Does this turn codegen on? I think the performance is fairly different when codegen is turned on. For 1.5, we are investigating having codegen on by default, so users get much better performance out of the box. On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri wrote: > Hi, > Somewhat similar to

Low throughput and effect of GC in SparkSql GROUP BY

2015-05-20 Thread Pramod Biligiri
Hi, Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a data point regarding the performance of Group By, indicating there's excessive GC and it's impacting the throughput. I want to know if the new memory manager for aggregations (https://github.com/apache/spark/pull/5725/)