I was thinking something similar to what Micah says. 

At TimescaleDB they do something like this. They use SQL. They separate this 
kind of function into three parts:

- aggregation function that returns a structure of type tdigest.

- access function that calculates one or more percentiles

- rollup function that makes a combination of tdigest structures

 

They use this same scheme for several aggregation functions: such as tdigest, 
hyperloglog and uddsketch. They have even moved typical SQL aggregation 
functions like avg or mean through the stats_agg structure to this schema 
because it simplifies some things in SQL.

 

If something similar could be done with arrow it would be great.

 

https://github.com/timescale/timescaledb-toolkit/blob/main/docs/tdigest.md

 

Translated with www.DeepL.com/Translator (free version)

De: Micah Kornfield <[email protected]> 
Enviado el: lunes, 28 de marzo de 2022 6:58
Para: [email protected]
Asunto: Re: [Python] pyarrow.compute.tdigest return class

 

Having the option of keeping the t-digest separate could be useful.  For 
instance Google's SQL  dialect allows for tracking some  sketch data structures 
separately [1]

 

 

[1] https://cloud.google.com/bigquery/docs/reference/standard-sql/hll_functions

 

On Mon, Mar 21, 2022 at 7:42 PM Yibo Cai <[email protected] 
<mailto:[email protected]> > wrote:

Do you mean you want to call pyarrow.compute.tdigest on different inputs over 
the time, and continuously merge the results into one tdigest?

 

Pyarrow.compute.tdigest (python wrapper of c++ kernel) is an aggregate kernel 
to consume input array and output the wanted quantiles. It’s not suitable to 
return the internal tdigest structure (and how can one make use of the tdigest 
structure?).

 

The c++ tdigest utility (not kernel) does supports merging tdigests. [1]

Is it possible to use the tdigest utility directly?

 

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/tdigest.h#L79

 

Yibo

 

From: [email protected] <mailto:[email protected]>  <[email protected] 
<mailto:[email protected]> > 
Sent: Monday, March 21, 2022 10:06 PM
To: [email protected] <mailto:[email protected]> 
Subject: [Python] pyarrow.compute.tdigest return class

 

Hello everyone,

 

Is there any way for the pyarrow.compute.tdigest function to return a TDigest 
structure in such a way that it can be merged?

 

I have a use case where I would like to store time series percentile 
distributions. The pyarrow function tdigest is very fast but the output is 
numbers and these cannot be aggregated.

 

I have tried using TDigest (https://github.com/CamDavidsonPilon/tdigest) but it 
is very slow.

 

Thank you very much.

 

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you. 

Reply via email to