Re: How do I deal with ever growing application log

2017-03-05 Thread devjyoti patra
Timothy, why are you writing application logs to HDFS? In case you want to analyze these logs later, you can write to local storage on your slave nodes and later rotate those files to a suitable location. If they are only going to useful for debugging the application, you can always remove them

Re: Fastest way to drop useless columns

2018-05-31 Thread devjyoti patra
One thing that we do on our datasets is : 1. Take 'n' random samples of equal size 2. If the distribution is heavily skewed for one key in your samples. The way we define "heavy skewness" is; if the mean is more than one std deviation away from the median. In your case, you can drop this column.

Re: optimize hive query to move a subset of data from one partition table to another table

2018-02-12 Thread devjyoti patra
Can you try running your query with static literal for date filter. (join_date >= SOME 2 MONTH OLD DATE). I cannot think of any reason why this query should create more than 60 tasks. On 12 Feb 2018 6:26 am, "amit kumar singh" wrote: Hi create table emp as select *

Re: How to deal with context dependent computing?

2018-08-27 Thread devjyoti patra
Hi Junfeng, You should be able to do this with window aggregation functions lead or lag https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions.html#lead Thanks, Dev On Mon, Aug 27, 2018 at 7:08 AM JF Chen wrote: > Thanks Sonal. > For example, I have data as following: >