Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
Robert Haas wrote: > On Mon, Jan 11, 2016 at 2:22 PM, Alvaro Herrera >wrote: > > Tomas Vondra wrote: > >> Attached is v2 of the patch, adding the comments. > > > > Looks pretty reasonable to me. I'm not sure we want to push this ahead > > of the bloom filter stuff, but it looks ready to commit otherwise. > > I'd say go ahead and push it. Done. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
On Tue, Jan 19, 2016 at 2:03 PM, Tomas Vondrawrote: > > FWIW I've been considering adding APPROX_COUNT_DISTINCT() aggregate, > similarly to what other databases (e.g. Vertica) have built-in. Now, that > would not require the merge too, but we're currently baking support for > 'combine' functions, and that's exactly what merge does. > > So why not just fix the bug? You can go from the sparse representation to the dense representation, so in principle you can merge two of our HLL states, if you are then satisfied with having a new representation for the merged state. I don't have time right now to do a full analysis of whether or not it's possible to just fix the bug without doing all that, but I think it might not be. I think we benefit from the simplicity of the sparse representation. So, in the absence of a clear justification for retaining mergeHyperLogLog(), ripping it out seems best. I also think that an expanded set of functionality would be required for your APPROX_COUNT_DISTINCT() patch anyway, including support for multiple representations (perhaps this isn't documented in your APPROX_COUNT_DISTINCT(), but complete implementations seem to switch from sparse to full at a certain point). So, ripping out mergeHyperLogLog() doesn't really make that effort any more difficult. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
On 01/19/2016 10:54 PM, Peter Geoghegan wrote: On Tue, Jan 19, 2016 at 9:37 AM, Alvaro Herrerawrote: Our transcript seems to predate that bugfix commit, so I assume we need to apply this to our copy too. Sadly, Hideaki-san commit message isn't very descriptive. Fortunately, the function mergeHyperLogLog() in our hyperloglog.c currently has no callers. I don't really know how HyperLogLog works, so maybe we can't or shouldn't apply the patch because of how the hash stuff is used. I think that Hideaki's confusion comes from whether or not this HLL state is a sparse or dense/full representation. The distinction is explained in the README for postgresql-hll: https://github.com/aggregateknowledge/postgresql-hll postgresql-hll has no support for merging HLLs that are sparse: https://github.com/aggregateknowledge/postgresql-hll/blob/master/hll.c#L1888 Can't we just tear mergeHyperLogLog() out? FWIW I've been considering adding APPROX_COUNT_DISTINCT() aggregate, similarly to what other databases (e.g. Vertica) have built-in. Now, that would not require the merge too, but we're currently baking support for 'combine' functions, and that's exactly what merge does. So why not just fix the bug? -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
On Tue, Jan 19, 2016 at 2:22 PM, Peter Geogheganwrote: > I don't have time right now to do a full analysis of whether or not it's > possible to just fix the bug without doing all that, but I think it > might not be. IOW: I think that Hideaki's bug fix might itself be wrong (although I might be wrong about that; no time to make sure right now). Note that Hideaki's implementation was not the only one I looked at when working on this code. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
On Tue, Jan 19, 2016 at 9:37 AM, Alvaro Herrerawrote: > Our transcript seems to predate that bugfix commit, so I assume we need > to apply this to our copy too. Sadly, Hideaki-san commit message isn't > very descriptive. Fortunately, the function mergeHyperLogLog() in our hyperloglog.c currently has no callers. > I don't really know how HyperLogLog works, so maybe we can't or > shouldn't apply the patch because of how the hash stuff is used. I think that Hideaki's confusion comes from whether or not this HLL state is a sparse or dense/full representation. The distinction is explained in the README for postgresql-hll: https://github.com/aggregateknowledge/postgresql-hll postgresql-hll has no support for merging HLLs that are sparse: https://github.com/aggregateknowledge/postgresql-hll/blob/master/hll.c#L1888 Can't we just tear mergeHyperLogLog() out? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
While going over this patch I noticed this commit in HyperLogLog's upstream: https://github.com/hideo55/cpp-HyperLogLog/commit/b8cb5e7b856928af15e9535b4b1650f493ba453f In the first hunk which is what we care about, the author is doing bit-or of both hash values instead of taking the max value, which is what we were doing. Our transcript seems to predate that bugfix commit, so I assume we need to apply this to our copy too. Sadly, Hideaki-san commit message isn't very descriptive. I don't really know how HyperLogLog works, so maybe we can't or shouldn't apply the patch because of how the hash stuff is used. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
On Mon, Jan 11, 2016 at 2:22 PM, Alvaro Herrerawrote: > Tomas Vondra wrote: >> Attached is v2 of the patch, adding the comments. > > Looks pretty reasonable to me. I'm not sure we want to push this ahead > of the bloom filter stuff, but it looks ready to commit otherwise. I'd say go ahead and push it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
Tomas Vondra wrote: > Attached is v2 of the patch, adding the comments. Looks pretty reasonable to me. I'm not sure we want to push this ahead of the bloom filter stuff, but it looks ready to commit otherwise. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
On 01/01/2016 12:08 AM, Peter Geoghegan wrote: On Thu, Dec 31, 2015 at 12:48 PM, Tomas Vondrawrote: 1) initHyperLogLogError(hyperLogLogState *cState, double error) Instead of specifying bwidth (essentially the number of bits used for addressing in the counter), this allows specifying the expected error rate for the counter, which is error_rate = 1.04 / sqrt(2^bwidth) So for 5% we get bwidth=5, and so on. This makes the API a bit easier the use, because there are pretty much no comments about the meaning of bwidth, and the existing callers simply use 10 without details. Fair, but you didn't add any better comments! 2) freeHyperLogLog(hyperLogLogState *cState) I think it's a good idea to provide function "undoing" what init does, i.e. freeing the internal memory etc. Currently that's trivial to do, but perhaps we'll make the structure more complicated in the future (albeit that might be unlikely). Seems reasonable. Attached is v2 of the patch, adding the comments. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 6cf109c466b60ecf62c27ce04757da51a9726729 Mon Sep 17 00:00:00 2001 From: Tomas Vondra Date: Sat, 9 Jan 2016 20:45:03 +0100 Subject: [PATCH] extend HyperLogLog API Extends the HyperLogLog API with two new functions. One allows specifying desired error rate when initializing the counter, the other one frees the internal counter state. --- src/backend/lib/hyperloglog.c | 45 +++ src/include/lib/hyperloglog.h | 2 ++ 2 files changed, 47 insertions(+) diff --git a/src/backend/lib/hyperloglog.c b/src/backend/lib/hyperloglog.c index 094bc09..2dca070 100644 --- a/src/backend/lib/hyperloglog.c +++ b/src/backend/lib/hyperloglog.c @@ -108,6 +108,51 @@ initHyperLogLog(hyperLogLogState *cState, uint8 bwidth) } /* + * Initialize HyperLogLog track state + * + * Instead of specifying bwidth (number of bits used for addressing the + * register), this method allows sizing the counter for particular error + * rate using a simple formula from the paper: + * + * e = 1.04 / sqrt(m) + * + * where 'm' is the number of registers, i.e. (2^bwidth). The method + * finds the lowest bwidth with 'e' below the requested error rate, and + * the uses it to initialize the counter. + * + * As bwidth has to be between 4 and 16, the worst error possible rate + * is ~25% (bwidth=4) and 0.4% (bwidth=16). + */ +void +initHyperLogLogError(hyperLogLogState *cState, double error) +{ + uint8 bwidth = 4; + + while (bwidth < 16) + { + double m = (Size)1 << bwidth; + if (1.04 / sqrt(m) < error) + break; + bwidth++; + } + + initHyperLogLog(cState, bwidth); +} + +/* + * Free HyperLogLog track state + * + * Releases the internal state, but not the state itself (in case it's + * not allocated by palloc). + */ +void +freeHyperLogLog(hyperLogLogState *cState) +{ + Assert(cState->hashesArr != NULL); + pfree(cState->hashesArr); +} + +/* * Adds element to the estimator, from caller-supplied hash. * * It is critical that the hash value passed be an actual hash value, typically diff --git a/src/include/lib/hyperloglog.h b/src/include/lib/hyperloglog.h index 5a1d4d3..b999b30 100644 --- a/src/include/lib/hyperloglog.h +++ b/src/include/lib/hyperloglog.h @@ -60,8 +60,10 @@ typedef struct hyperLogLogState } hyperLogLogState; extern void initHyperLogLog(hyperLogLogState *cState, uint8 bwidth); +extern void initHyperLogLogError(hyperLogLogState *cState, double error); extern void addHyperLogLog(hyperLogLogState *cState, uint32 hash); extern double estimateHyperLogLog(hyperLogLogState *cState); extern void mergeHyperLogLog(hyperLogLogState *cState, const hyperLogLogState *oState); +extern void freeHyperLogLog(hyperLogLogState *cState); #endif /* HYPERLOGLOG_H */ -- 2.1.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
Meh, of course I forgot to actually attach the patch. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 8f207693faa65e65e8a1e5e894c2ad96ad1f3cea Mon Sep 17 00:00:00 2001 From: Tomas VondraDate: Mon, 28 Dec 2015 14:20:17 +0100 Subject: [PATCH 1/2] extend the HyperLogLog API a bit by adding two more methods - initHyperLogLogError (initialize the counter for error rate) - freeHyperLogLog (release the memory allocated for counter state) --- src/backend/lib/hyperloglog.c | 29 + src/include/lib/hyperloglog.h | 2 ++ 2 files changed, 31 insertions(+) diff --git a/src/backend/lib/hyperloglog.c b/src/backend/lib/hyperloglog.c index 718afb8..2949a8d 100644 --- a/src/backend/lib/hyperloglog.c +++ b/src/backend/lib/hyperloglog.c @@ -108,6 +108,35 @@ initHyperLogLog(hyperLogLogState *cState, uint8 bwidth) } /* + * Initialize HyperLogLog track state + */ +void +initHyperLogLogError(hyperLogLogState *cState, double error) +{ + uint8 bwidth = 4; + + while (bwidth < 16) + { + double m = (Size)1 << bwidth; + if (1.04 / sqrt(m) < error) + break; + bwidth++; + } + + initHyperLogLog(cState, bwidth); +} + +/* + * Free HyperLogLog track state + */ +void +freeHyperLogLog(hyperLogLogState *cState) +{ + Assert(cState->hashesArr != NULL); + pfree(cState->hashesArr); +} + +/* * Adds element to the estimator, from caller-supplied hash. * * It is critical that the hash value passed be an actual hash value, typically diff --git a/src/include/lib/hyperloglog.h b/src/include/lib/hyperloglog.h index fd8280c..004490a 100644 --- a/src/include/lib/hyperloglog.h +++ b/src/include/lib/hyperloglog.h @@ -60,8 +60,10 @@ typedef struct hyperLogLogState } hyperLogLogState; extern void initHyperLogLog(hyperLogLogState *cState, uint8 bwidth); +extern void initHyperLogLogError(hyperLogLogState *cState, double error); extern void addHyperLogLog(hyperLogLogState *cState, uint32 hash); extern double estimateHyperLogLog(hyperLogLogState *cState); extern void mergeHyperLogLog(hyperLogLogState *cState, const hyperLogLogState *oState); +extern void freeHyperLogLog(hyperLogLogState *cState); #endif /* HYPERLOGLOG_H */ -- 2.1.0 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: Extending the HyperLogLog API a bit
On Thu, Dec 31, 2015 at 12:48 PM, Tomas Vondrawrote: > 1) initHyperLogLogError(hyperLogLogState *cState, double error) > >Instead of specifying bwidth (essentially the number of bits used >for addressing in the counter), this allows specifying the expected >error rate for the counter, which is > > error_rate = 1.04 / sqrt(2^bwidth) > >So for 5% we get bwidth=5, and so on. This makes the API a bit easier >the use, because there are pretty much no comments about the meaning >of bwidth, and the existing callers simply use 10 without details. Fair, but you didn't add any better comments! > 2) freeHyperLogLog(hyperLogLogState *cState) > >I think it's a good idea to provide function "undoing" what init >does, i.e. freeing the internal memory etc. Currently that's trivial >to do, but perhaps we'll make the structure more complicated in the >future (albeit that might be unlikely). Seems reasonable. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers