I was thinking something similar to what Micah says. At TimescaleDB they do something like this. They use SQL. They separate this kind of function into three parts:
- aggregation function that returns a structure of type tdigest. - access function that calculates one or more percentiles - rollup function that makes a combination of tdigest structures They use this same scheme for several aggregation functions: such as tdigest, hyperloglog and uddsketch. They have even moved typical SQL aggregation functions like avg or mean through the stats_agg structure to this schema because it simplifies some things in SQL. If something similar could be done with arrow it would be great. https://github.com/timescale/timescaledb-toolkit/blob/main/docs/tdigest.md Translated with www.DeepL.com/Translator (free version) De: Micah Kornfield <[email protected]> Enviado el: lunes, 28 de marzo de 2022 6:58 Para: [email protected] Asunto: Re: [Python] pyarrow.compute.tdigest return class Having the option of keeping the t-digest separate could be useful. For instance Google's SQL dialect allows for tracking some sketch data structures separately [1] [1] https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions On Mon, Mar 21, 2022 at 7:42 PM Yibo Cai <[email protected] <mailto:[email protected]> > wrote: Do you mean you want to call pyarrow.compute.tdigest on different inputs over the time, and continuously merge the results into one tdigest? Pyarrow.compute.tdigest (python wrapper of c++ kernel) is an aggregate kernel to consume input array and output the wanted quantiles. It’s not suitable to return the internal tdigest structure (and how can one make use of the tdigest structure?). The c++ tdigest utility (not kernel) does supports merging tdigests. [1] Is it possible to use the tdigest utility directly? [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/tdigest.h#L79 Yibo From: [email protected] <mailto:[email protected]> <[email protected] <mailto:[email protected]> > Sent: Monday, March 21, 2022 10:06 PM To: [email protected] <mailto:[email protected]> Subject: [Python] pyarrow.compute.tdigest return class Hello everyone, Is there any way for the pyarrow.compute.tdigest function to return a TDigest structure in such a way that it can be merged? I have a use case where I would like to store time series percentile distributions. The pyarrow function tdigest is very fast but the output is numbers and these cannot be aggregated. I have tried using TDigest (https://github.com/CamDavidsonPilon/tdigest) but it is very slow. Thank you very much. IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
