Re: Fastest way to drop useless columns
One thing that we do on our datasets is : 1. Take 'n' random samples of equal size 2. If the distribution is heavily skewed for one key in your samples. The way we define "heavy skewness" is; if the mean is more than one std deviation away from the median. In your case, you can drop this column. On Thu, 31 May 2018, 14:55 , wrote: > I believe this only works when we need to drop duplicate ROWS > > Here I want to drop cols which contains one unique value > > > Le 2018-05-31 11:16, Divya Gehlot a écrit : > > you can try dropduplicate function > > > > > https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala > > > > On 31 May 2018 at 16:34, wrote: > > > >> Hi there ! > >> > >> I have a potentially large dataset ( regarding number of rows and > >> cols ) > >> > >> And I want to find the fastest way to drop some useless cols for me, > >> i.e. cols containing only an unique value ! > >> > >> I want to know what do you think that I could do to do this as fast > >> as possible using spark. > >> > >> I already have a solution using distinct().count() or > >> approxCountDistinct() > >> But, they may not be the best choice as this requires to go through > >> all the data, even if the 2 first tested values for a col are > >> already different ( and in this case I know that I can keep the col > >> ) > >> > >> Thx for your ideas ! > >> > >> Julien > >> > >> > > - > >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Fastest way to drop useless columns
I believe this only works when we need to drop duplicate ROWS Here I want to drop cols which contains one unique value Le 2018-05-31 11:16, Divya Gehlot a écrit : you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark. I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col ) Thx for your ideas ! Julien - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Fastest way to drop useless columns
you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way to drop some useless cols for me, i.e. > cols containing only an unique value ! > > I want to know what do you think that I could do to do this as fast as > possible using spark. > > > I already have a solution using distinct().count() or approxCountDistinct() > But, they may not be the best choice as this requires to go through all > the data, even if the 2 first tested values for a col are already different > ( and in this case I know that I can keep the col ) > > > Thx for your ideas ! > > Julien > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Fastest way to drop useless columns
Hi Julien, One quick and easy to implement idea is to use sampling on your dataset, i.e., sample a large enough subset of your data and test is there are no unique values on some columns. Repeat the process a few times and then do the full test on the surviving columns. This will allow you to load only a subset of your dataset if it is stored in Parquet. Best, Anastasios On Thu, May 31, 2018 at 10:34 AM, wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way to drop some useless cols for me, i.e. > cols containing only an unique value ! > > I want to know what do you think that I could do to do this as fast as > possible using spark. > > > I already have a solution using distinct().count() or approxCountDistinct() > But, they may not be the best choice as this requires to go through all > the data, even if the 2 first tested values for a col are already different > ( and in this case I know that I can keep the col ) > > > Thx for your ideas ! > > Julien > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias
Fastest way to drop useless columns
Hi there ! I have a potentially large dataset ( regarding number of rows and cols ) And I want to find the fastest way to drop some useless cols for me, i.e. cols containing only an unique value ! I want to know what do you think that I could do to do this as fast as possible using spark. I already have a solution using distinct().count() or approxCountDistinct() But, they may not be the best choice as this requires to go through all the data, even if the 2 first tested values for a col are already different ( and in this case I know that I can keep the col ) Thx for your ideas ! Julien - To unsubscribe e-mail: user-unsubscr...@spark.apache.org