Hi Richard, these intermediate statistics should be calculated from the result of the calculation or during the aggregation? If they can be derived from the resulting dataframe, why not to cache (persist) that result just after the calculation? Then you may aggregate statistics from the cached dataframe. This way it won't hit performance too much.
Regards -- Bedrytski Aliaksandr sp...@bedryt.ski On Wed, Aug 24, 2016, at 16:42, Richard Siebeling wrote: > Hi, > > what is the best way to calculate intermediate column statistics like > the number of empty values and the number of distinct values each > column in a dataset when aggregating of filtering data next to the > actual result of the aggregate or the filtered data? > > We are developing an application in which the user can slice-and-dice > through the data and we would like to, next to the actual resulting > data, get column statistics of each column in the resulting dataset. > We prefer to calculate the column statistics on the same pass over the > data as the actual aggregation or filtering, is that possible? > > We could sacrifice a little bit of performance (but not too much), > that's why we prefer one pass... > > Is this possible in the standard Spark or would this mean > modifying the source a little bit and recompiling? Is that > feasible / wise to do? > > thanks in advance, > Richard > > > >