subject:"Creating a memory\-efficient AggregateFunction to calculate Median"

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas

Thanks for the suggestions. I suppose I should share a bit more about what I tried/learned, so others who come later can understand why a memory-efficient, exact median is not in Spark. Spark's own ApproximatePercentile also uses QuantileSummaries internally

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Fitch, Simeon

Nicholas, This may or may not be much help, but in RasterFrames we have an approximate quantiles Expression computed against Tiles (2d geospatial arrays) which makes use of `org.apache.spark.sql.catalyst.util.QuantileSummaries` to do the hard work. So perhaps a directionally correct example of doi

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Sean Owen

Parquet or ORC have the necessary stats to make this fast too already, but only helps if you want the median of sorted data as stored on disk, rather than the general case. Not sure you can do better than roughly what a sort entails if you want the exact median On Wed, Dec 15, 2021, 8:56 AM Pol Sa

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Pol Santamaria

Correct me if I am wrong, but If the dataset was indexed by the given column, you could get the median without reading the whole dataset, shuffling, and so on. Disclaimer (I work in Qbeast). So the issue is more on the data format and the possibility to push down the operation to the data source.

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas

Yeah, I think approximate percentile is good enough most of the time. I don't have a specific need for a precise median. I was interested in implementing it more as a Catalyst learning exercise, but it turns out I picked a bad learning exercise to solve. :) On Mon, Dec 13, 2021 at 9:46 PM Reynold

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Reynold Xin

tl;dr: there's no easy way to implement aggregate expressions that'd require multiple pass over data. It is simply not something that's supported and doing so would be very high cost. Would you be OK using approximate percentile? That's relatively cheap. On Mon, Dec 13, 2021 at 6:43 PM, Nichola

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-13 Thread Nicholas Chammas

No takers here? :) I can see now why a median function is not available in most data processing systems. It's pretty annoying to implement! On Thu, Dec 9, 2021 at 9:25 PM Nicholas Chammas wrote: > I'm trying to create a new aggregate function. It's my first time working > with Catalyst, so it's

Creating a memory-efficient AggregateFunction to calculate Median

2021-12-09 Thread Nicholas Chammas

I'm trying to create a new aggregate function. It's my first time working with Catalyst, so it's exciting---but I'm also in a bit over my head. My goal is to create a function to calculate the median . As a very simple solution, I could just defi

Re: Creating a memory-efficient AggregateFunction to calculate Median

Re: Creating a memory-efficient AggregateFunction to calculate Median

Re: Creating a memory-efficient AggregateFunction to calculate Median

Re: Creating a memory-efficient AggregateFunction to calculate Median

Re: Creating a memory-efficient AggregateFunction to calculate Median

Re: Creating a memory-efficient AggregateFunction to calculate Median

Re: Creating a memory-efficient AggregateFunction to calculate Median

Creating a memory-efficient AggregateFunction to calculate Median

8 matches

Site Navigation

Mail list logo

Footer information