Hi Zhang,
No my data is not compressed. I'm trying to minimize the load on the CPU.
The GC time reduced for me after codegen.
Pramod
On Thu, May 21, 2015 at 3:43 AM, zhangxiongfei
wrote:
> Hi Pramod
>
> Is your data compressed? I encountered similar problem,however, after
> turned codegen on,
Hi Pramod
Is your data compressed? I encountered similar problem,however, after turned
codegen on, the GC time was still very long.The size of input data for my map
task is about 100M lzo file.
My query is ""select ip, count(*) as c from stage_bitauto_adclick_d group by ip
sort by c limit 10
Yup it is a different path. It runs GeneratedAggregate.
On Wed, May 20, 2015 at 11:43 PM, Pramod Biligiri
wrote:
> I hadn't turned on codegen. I enabled it and ran it again, it is running
> 4-5 times faster now! :)
> Since my log statements are no longer appearing, I presume the code path
> seem
I hadn't turned on codegen. I enabled it and ran it again, it is running
4-5 times faster now! :)
Since my log statements are no longer appearing, I presume the code path
seems quite different from the earlier hashmap related stuff in
Aggregates.scala?
Pramod
On Wed, May 20, 2015 at 9:18 PM, Reyn
Does this turn codegen on? I think the performance is fairly different when
codegen is turned on.
For 1.5, we are investigating having codegen on by default, so users get
much better performance out of the box.
On Wed, May 20, 2015 at 5:24 PM, Pramod Biligiri
wrote:
> Hi,
> Somewhat similar to
Hi,
Somewhat similar to Daniel Mescheder's mail yesterday on SparkSql, I have a
data point regarding the performance of Group By, indicating there's
excessive GC and it's impacting the throughput. I want to know if the new
memory manager for aggregations (https://github.com/apache/spark/pull/5725/)