Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Maciej
Makes sense. Thanks! On 12/15/21 21:36, Jungtaek Lim wrote: > If ASF wants to do it, INFRA could probably deal with it for entire > projects, like ASF code of conduct being exposed to the right side of > the all ASF github repos recently. > > On Wed, Dec 15, 2021 at 11:49 PM Sean Owen wrote: > >

Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Jungtaek Lim
If ASF wants to do it, INFRA could probably deal with it for entire projects, like ASF code of conduct being exposed to the right side of the all ASF github repos recently. On Wed, Dec 15, 2021 at 11:49 PM Sean Owen wrote: > It might imply that this is a way to fund Spark alone, and it isn't. >

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Nicholas Chammas
Thanks for the suggestions. I suppose I should share a bit more about what I tried/learned, so others who come later can understand why a memory-efficient, exact median is not in Spark. Spark's own ApproximatePercentile also uses QuantileSummaries internally

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Fitch, Simeon
Nicholas, This may or may not be much help, but in RasterFrames we have an approximate quantiles Expression computed against Tiles (2d geospatial arrays) which makes use of `org.apache.spark.sql.catalyst.util.QuantileSummaries` to do the hard work. So perhaps a directionally correct example of

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Sean Owen
Parquet or ORC have the necessary stats to make this fast too already, but only helps if you want the median of sorted data as stored on disk, rather than the general case. Not sure you can do better than roughly what a sort entails if you want the exact median On Wed, Dec 15, 2021, 8:56 AM Pol

Re: Creating a memory-efficient AggregateFunction to calculate Median

2021-12-15 Thread Pol Santamaria
Correct me if I am wrong, but If the dataset was indexed by the given column, you could get the median without reading the whole dataset, shuffling, and so on. Disclaimer (I work in Qbeast). So the issue is more on the data format and the possibility to push down the operation to the data source.

Re: [MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Sean Owen
It might imply that this is a way to fund Spark alone, and it isn't. Probably no big deal either way but maybe not worth it. It won't be a mystery how to find and fund the ASF for the few orgs that want to, as compared to a small project On Wed, Dec 15, 2021, 8:34 AM Maciej wrote: > Hi All, > >

[MISC] Should we add .github/FUNDING.yml

2021-12-15 Thread Maciej
Hi All, Just wondering ‒ would it make sense to add .github/FUNDING.yml with custom link pointing to one (or both) of these: * https://www.apache.org/foundation/sponsorship.html * https://www.apache.org/foundation/contributing.html -- Best regards, Maciej Szymkiewicz Web: