Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas
Thanks for the suggestions. I suppose I should share a bit more about what I tried/learned, so others who come later can understand why a memory-efficient, exact median is not in Spark. Spark's own ApproximatePercentile also uses QuantileSummaries internally

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Fitch, Simeon
Nicholas, This may or may not be much help, but in RasterFrames we have an approximate quantiles Expression computed against Tiles (2d geospatial arrays) which makes use of `org.apache.spark.sql.catalyst.util.QuantileSummaries` to do the hard work. So perhaps a directionally correct example of

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Sean Owen
Parquet or ORC have the necessary stats to make this fast too already, but only helps if you want the median of sorted data as stored on disk, rather than the general case. Not sure you can do better than roughly what a sort entails if you want the exact median On Wed, Dec 15, 2021, 8:56 AM Pol

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Pol Santamaria
Correct me if I am wrong, but If the dataset was indexed by the given column, you could get the median without reading the whole dataset, shuffling, and so on. Disclaimer (I work in Qbeast). So the issue is more on the data format and the possibility to push down the operation to the data source.

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
Yeah, I think approximate percentile is good enough most of the time. I don't have a specific need for a precise median. I was interested in implementing it more as a Catalyst learning exercise, but it turns out I picked a bad learning exercise to solve. :) On Mon, Dec 13, 2021 at 9:46 PM

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Reynold Xin
tl;dr: there's no easy way to implement aggregate expressions that'd require multiple pass over data. It is simply not something that's supported and doing so would be very high cost. Would you be OK using approximate percentile? That's relatively cheap. On Mon, Dec 13, 2021 at 6:43 PM,

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas
No takers here? :) I can see now why a median function is not available in most data processing systems. It's pretty annoying to implement! On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas wrote: > I'm trying to create a new aggregate function. It's my first time working > with Catalyst, so

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas
I'm trying to create a new aggregate function. It's my first time working with Catalyst, so it's exciting---but I'm also in a bit over my head. My goal is to create a function to calculate the median . As a very simple solution, I could just