Hi Julien, One quick and easy to implement idea is to use sampling on your dataset, i.e., sample a large enough subset of your data and test is there are no unique values on some columns. Repeat the process a few times and then do the full test on the surviving columns.
This will allow you to load only a subset of your dataset if it is stored in Parquet. Best, Anastasios On Thu, May 31, 2018 at 10:34 AM, <julio.ces...@free.fr> wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way to drop some useless cols for me, i.e. > cols containing only an unique value ! > > I want to know what do you think that I could do to do this as fast as > possible using spark. > > > I already have a solution using distinct().count() or approxCountDistinct() > But, they may not be the best choice as this requires to go through all > the data, even if the 2 first tested values for a col are already different > ( and in this case I know that I can keep the col ) > > > Thx for your ideas ! > > Julien > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- -- Anastasios Zouzias <a...@zurich.ibm.com>