.
From: Sathi Chowdhury
Date: Thursday, 5 September 2019 at 8:10 PM
To: Himali Patel , "user@spark.apache.org"
Subject: Re: Tune hive query launched thru spark-yarn job.
What I can immediately think of is,
as you are doing IN in the where clause for a series of timestamps, if you can
What I can immediately think of is,
as you are doing IN in the where clause for a series of timestamps, if you can
consider breaking them and for each epoch timestampYou can load your results to
an intermediate staging table and then do a final aggregate from that table
keeping the group by
Hello all,
We have one use-case where we are aggregating billion of rows. It does huge
shuffle.
Example :
As per ‘Job’ tab on yarn UI
When Input size is 350 G something, shuffle size >3 TBs. This increases
Non-DFS usage beyond warning limit and thus affecting entire cluster.
It seems we need