Timothy, why are you writing application logs to HDFS? In case you want to
analyze these logs later, you can write to local storage on your slave
nodes and later rotate those files to a suitable location. If they are only
going to useful for debugging the application, you can always remove them
One thing that we do on our datasets is :
1. Take 'n' random samples of equal size
2. If the distribution is heavily skewed for one key in your samples. The
way we define "heavy skewness" is; if the mean is more than one std
deviation away from the median.
In your case, you can drop this column.
Can you try running your query with static literal for date filter.
(join_date >= SOME 2 MONTH OLD DATE). I cannot think of any reason why this
query should create more than 60 tasks.
On 12 Feb 2018 6:26 am, "amit kumar singh" wrote:
Hi
create table emp as select *
Hi Junfeng,
You should be able to do this with window aggregation functions lead or
lag
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions.html#lead
Thanks,
Dev
On Mon, Aug 27, 2018 at 7:08 AM JF Chen wrote:
> Thanks Sonal.
> For example, I have data as following:
>