[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148425#comment-16148425 ] Lars Hofhansl commented on PHOENIX-418: --- You guys kick a**. :) > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: New Feature >Reporter: James Taylor >Assignee: Ethan Wang >Priority: Blocker > Labels: gsoc2016 > Fix For: 4.12.0 > > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch, PHOENIX-418-v6-pom.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148175#comment-16148175 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12884517/PHOENIX-418-v6-pom.patch against master branch at commit e6c1f01c5f7d5c9996017714efe90202a95355b2. ATTACHMENT ID: 12884517 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation, build, or dev patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 62 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 lineLengths{color}. The patch does not introduce lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexFailureIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.GroupByIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TableSnapshotReadsMapReduceIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1325//testReport/ Javadoc warnings: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1325//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1325//console This message is automatically generated. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: New Feature >Reporter: James Taylor >Assignee: Ethan Wang >Priority: Blocker > Labels: gsoc2016 > Fix For: 4.12.0 > > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch, PHOENIX-418-v6-pom.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147867#comment-16147867 ] James Taylor commented on PHOENIX-418: -- +1. Thanks for the quick turnaround, [~aertoria] and for the help [~mujtabachohan]. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: New Feature >Reporter: James Taylor >Assignee: Ethan Wang >Priority: Blocker > Labels: gsoc2016 > Fix For: 4.12.0 > > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch, PHOENIX-418-v6-pom.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147721#comment-16147721 ] James Taylor commented on PHOENIX-418: -- By updating the assembly of the phoenix server jar in phoenix-assembly, you can get the clearspring classes bundled in the jar so that copy isn't necessary. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: New Feature >Reporter: James Taylor >Assignee: Ethan Wang >Priority: Blocker > Labels: gsoc2016 > Fix For: 4.12.0 > > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147704#comment-16147704 ] Ethan Wang commented on PHOENIX-418: I see. I forgot to mention that, for me the local test to work, I have to copy both the compiled phoenix-server.jar as well as a stream.jar to hbase lib folder. I will go prepare another patch > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: New Feature >Reporter: James Taylor >Assignee: Ethan Wang >Priority: Blocker > Labels: gsoc2016 > Fix For: 4.12.0 > > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147690#comment-16147690 ] James Taylor commented on PHOENIX-418: -- [~aertoria] - please see comment above. Need to get the clearspring library into the server jar. This is controlled by phoenix-assembly. [~mujtabachohan], [~samarthjain], or [~tdsilva] can help if you run into any issues. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: New Feature >Reporter: James Taylor >Assignee: Ethan Wang >Priority: Blocker > Labels: gsoc2016 > Fix For: 4.12.0 > > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146707#comment-16146707 ] Lars Hofhansl commented on PHOENIX-418: --- Shouldn't the clearspring library be part of the server jar? I just did a clean build from scratch, and indeed APPROX_COUNT_DISTINCT fails with {code} ... Caused by: java.lang.NoClassDefFoundError: com/clearspring/analytics/stream/cardinality/HyperLogLogPlus ... {code} thrown at the server. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144645#comment-16144645 ] Hudson commented on PHOENIX-418: FAILURE: Integrated in Jenkins build Phoenix-master #1753 (See [https://builds.apache.org/job/Phoenix-master/1753/]) PHOENIX-418 Support approximate COUNT DISTINCT (Ethan Wang) (jtaylor: rev d6381afc3af976ccdbb874d4458ea17b1e8a1d32) * (add) phoenix-core/src/main/java/org/apache/phoenix/expression/function/DistinctCountHyperLogLogAggregateFunction.java * (add) phoenix-core/src/it/java/org/apache/phoenix/end2end/CountDistinctApproximateHyperLogLogIT.java * (edit) dev/release_files/NOTICE * (edit) phoenix-core/pom.xml * (edit) phoenix-core/src/main/java/org/apache/phoenix/expression/ExpressionType.java * (add) phoenix-core/src/main/java/org/apache/phoenix/parse/DistinctCountHyperLogLogAggregateParseNode.java > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Merged patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144482#comment-16144482 ] James Taylor commented on PHOENIX-418: -- Do you know what the difference are, [~mujtabachohan] between "ubuntu-eu2" versus "H4"? Different JVMs? > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144459#comment-16144459 ] James Taylor commented on PHOENIX-418: -- Thanks, [~aertoria]. I've pushed your last patch to 4.x and master branches. Nice work! > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144380#comment-16144380 ] Ethan Wang commented on PHOENIX-418: Changes has been made to pom for using 2.9.5. The flaky IT tests went away with this batch (v6). [~jamestaylor] During the investigation, one thing observed is that both success runs happens when the Jenkins node "ubuntu-eu2" pick it up (versus "H4" during the weekend). I'm not sure if this is related. [~samarthjain] > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144378#comment-16144378 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12884107/PHOENIX-418-v6.patch against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790. ATTACHMENT ID: 12884107 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 56 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) FROM " + tableName + " a, " + tableName + final private void prepareTableWithValues(final Connection conn, final int nRows) throws Exception { + final PreparedStatement stmt = conn.prepareStatement("upsert into " + tableName + " VALUES (?, ?)"); +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + protected final ImmutableBytesWritable valueByteArray = new ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY); +public DistinctCountHyperLogLogAggregateParseNode(String name, List children, BuiltInFunctionInfo info) { +public FunctionExpression create(List children, StatementContext context) throws SQLException { {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//testReport/ Javadoc warnings: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//console This message is automatically generated. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144178#comment-16144178 ] James Taylor commented on PHOENIX-418: -- [~aertoria]. Thanks for the updated patch. Please update the phoenix-core/pom.xml to use the 2.9.5 version of com.clearspring.analytics:stream instead of 2.7.0 if there's no reason not to go with the latest. Any idea why the test runs are so flaky? I'm going to try running them locally too, but it seems like more than the usual suspects failing. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143482#comment-16143482 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12883988/PHOENIX-418-v5.patch against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790. ATTACHMENT ID: 12883988 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 56 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) FROM " + tableName + " a, " + tableName + final private void prepareTableWithValues(final Connection conn, final int nRows) throws Exception { + final PreparedStatement stmt = conn.prepareStatement("upsert into " + tableName + " VALUES (?, ?)"); +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + protected final ImmutableBytesWritable valueByteArray = new ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY); +public DistinctCountHyperLogLogAggregateParseNode(String name, List children, BuiltInFunctionInfo info) { +public FunctionExpression create(List children, StatementContext context) throws SQLException { {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CsvBulkLoadToolIT
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142964#comment-16142964 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12883953/PHOENIX-418-v9.patch against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790. ATTACHMENT ID: 12883953 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 56 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) FROM " + tableName + " a, " + tableName + final private void prepareTableWithValues(final Connection conn, final int nRows) throws Exception { + final PreparedStatement stmt = conn.prepareStatement("upsert into " + tableName + " VALUES (?, ?)"); +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + protected final ImmutableBytesWritable valueByteArray = new ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY); +public DistinctCountHyperLogLogAggregateParseNode(String name, List children, BuiltInFunctionInfo info) { +public FunctionExpression create(List children, StatementContext context) throws SQLException { {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CsvBulkLoadToolIT
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142533#comment-16142533 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12883785/PHOENIX-418-v7.patch against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790. ATTACHMENT ID: 12883785 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 56 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) FROM " + tableName + " a, " + tableName + final private void prepareTableWithValues(final Connection conn, final int nRows) throws Exception { + final PreparedStatement stmt = conn.prepareStatement("upsert into " + tableName + " VALUES (?, ?)"); +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + protected final ImmutableBytesWritable valueByteArray = new ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY); +public DistinctCountHyperLogLogAggregateParseNode(String name, List children, BuiltInFunctionInfo info) { +public FunctionExpression create(List children, StatementContext context) throws SQLException { {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexFailureIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1305//testReport/ Javadoc warnings: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1305//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1305//console This message is automatically generated. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch, PHOENIX-418-v7.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141767#comment-16141767 ] Josh Elser commented on PHOENIX-418: bq. So we should add that text to our NOTICE file, right? Correct, specifically to {{dev/release_files/NOTICE}} only (not the top-level NOTICE file as that is for our src-release which would not actually contain this JAR). > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141430#comment-16141430 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12883652/PHOENIX-418-v6.patch against master branch at commit dd008f4fe4e67836a6fb362843c9618e4e518182. ATTACHMENT ID: 12883652 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 56 warning messages. {color:red}-1 release audit{color}. The applied patch generated 2 release audit warnings (more than the master's current 0 warnings). {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) FROM " + tableName + " a, " + tableName + final private void prepareTableWithValues(final Connection conn, final int nRows) throws Exception { + final PreparedStatement stmt = conn.prepareStatement("upsert into " + tableName + " VALUES (?, ?)"); +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + protected final ImmutableBytesWritable valueByteArray = new ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY); +public DistinctCountHyperLogLogAggregateParseNode(String name, List children, BuiltInFunctionInfo info) { +public FunctionExpression create(List children, StatementContext context) throws SQLException { {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141219#comment-16141219 ] James Taylor commented on PHOENIX-418: -- Thanks for the info, [~elserj]. So we should add that text to our NOTICE file, right? > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141126#comment-16141126 ] Ethan Wang commented on PHOENIX-418: Thanks [~elserj]. 2.9.5 should be equally just fine. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141100#comment-16141100 ] Josh Elser commented on PHOENIX-418: [~jamestaylor], thanks for thinking about the licensing! Turns out, there is something that we should propagate in the binary tarball https://github.com/addthis/stream-lib/blob/master/NOTICE.txt Specifically, we should include text, something like: {noformat} stream-lib Copyright 2016 AddThis This product includes software developed by AddThis. {noformat} Aside, it appears that there is a 2.9.5 version of {{com.clearspring.analytics:stream}} in maven-central. Is there a reason for us to use 2.7.0, specifically? Neat little change, Ethan. I look forward to playing around with the feature :) > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141019#comment-16141019 ] James Taylor commented on PHOENIX-418: -- One more thing: are any changes required to phoenix-assembly for the new jar, [~elserj] or [~mujtabachohan]? Might want to double check the license dependencies too, Josh. They looked ok to me. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141013#comment-16141013 ] James Taylor commented on PHOENIX-418: -- +1. I'll remove CountDistinctApproximateHyperLogLogStatsEnabledIT as I don't think we need that. Your local test runs look ok, [~aertoria]? > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140890#comment-16140890 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12883574/PHOENIX-418-v5.patch against master branch at commit dd008f4fe4e67836a6fb362843c9618e4e518182. ATTACHMENT ID: 12883574 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 56 warning messages. {color:red}-1 release audit{color}. The applied patch generated 2 release audit warnings (more than the master's current 0 warnings). {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "SELECT APPROX_COUNT_DISTINCT(ORGANIZATION_ID||A_STRING||B_STRING) FROM " + tableName; +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "SELECT APPROX_COUNT_DISTINCT(a.ORGANIZATION_ID||a.A_STRING||b.ENTITY_ID) FROM " + + tableName + " a, "+ tableName + " b where a.ORGANIZATION_ID=b.ORGANIZATION_ID and a.ENTITY_ID = b.ENTITY_ID"; +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "explain SELECT APPROX_COUNT_DISTINCT(ORGANIZATION_ID||ENTITY_ID) FROM " + tableName; {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CsvBulkLoadToolIT
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140664#comment-16140664 ] James Taylor commented on PHOENIX-418: -- bq. I think it also makes sense to have another test derived from ParallelStatsEnabledIT Never mind about this - I was thinking of TABLESAMPLE, but you've already done this. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137402#comment-16137402 ] James Taylor commented on PHOENIX-418: -- Thanks for the revised patch, [~aertoria]. Looks very good. A couple of minor things: - derive your test from ParallelStatsDisabledIT instead of BaseUniqueNamesOwnClusterIT and remove the setup method which you won't need. The advantage of ParallelStatsDisabledIT tests are that they don't need to each spin up a new mini cluster when they run so are overall test run time stays lower. {code} +public class CountDistinctApproximateHyperLogLogIT extends BaseUniqueNamesOwnClusterIT { +@BeforeClass +public static void doSetup() throws Exception { +Mapprops = Maps.newHashMapWithExpectedSize(3); +setUpTestDriver(new ReadOnlyProps(props.entrySet().iterator())); +} + {code} - I think it also makes sense to have another test derived from ParallelStatsEnabledIT. This base test class is configured to collect statistics. In this way, you can get more test coverage. You can likely run the exact same tests, but in this case you'll have guideposts in place (because stats will be collected). Make sure to call TestUtil.analyzeTable(connection, fullTableName) prior to running your TABLESAMPLE queries. You'll get more rows back, since you'll have guideposts in addition to region boundaries. - minor nit, extra semicolon here: {code} + DistinctCountHyperLogLogAggregateFunction(DistinctCountHyperLogLogAggregateFunction.class);; {code} - Instead of always copying the underlying byte buffer, use ByteUtil.copyKeyBytesIfNecessary(ImmutableBytesWritable ptr) instead which only copies when necessary: + @Override + public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) { + try { + valueByteArray.set(hll.getBytes(), 0, hll.getBytes().length); + ptr.set(valueByteArray.copyBytes()); {code} > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115982#comment-16115982 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12880580/PHOENIX-418-v4.patch against master branch at commit bf3c13b6d169e90e06c54a338a5931163a32d743. ATTACHMENT ID: 12880580 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 54 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "SELECT APPROX_COUNT_DISTINCT(ORGANIZATION_ID||A_STRING||B_STRING) FROM " + tableName; +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "SELECT APPROX_COUNT_DISTINCT(a.ORGANIZATION_ID||a.A_STRING||b.ENTITY_ID) FROM " + + tableName + " a, "+ tableName + " b where a.ORGANIZATION_ID=b.ORGANIZATION_ID and a.ENTITY_ID = b.ENTITY_ID"; +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "explain SELECT APPROX_COUNT_DISTINCT(ORGANIZATION_ID||ENTITY_ID) FROM " + tableName; {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.execute.PartialCommitIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ProductMetricsIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ScanQueryIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryDatabaseMetaDataIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.NotQueryIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ClientTimeArithmeticQueryIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1246//testReport/ Javadoc warnings: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1246//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1246//console This message is automatically generated. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115574#comment-16115574 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12880542/PHOENIX-418-v3.patch against master branch at commit bf3c13b6d169e90e06c54a338a5931163a32d743. ATTACHMENT ID: 12880542 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 54 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "SELECT APPROX_COUNT_DISTINCT(ORGANIZATION_ID||A_STRING||B_STRING) FROM " + tableName; +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "SELECT APPROX_COUNT_DISTINCT(a.ORGANIZATION_ID||a.A_STRING||b.ENTITY_ID) FROM " + + tableName + " a, "+ tableName + " b where a.ORGANIZATION_ID=b.ORGANIZATION_ID and a.ENTITY_ID = b.ENTITY_ID"; +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "explain SELECT APPROX_COUNT_DISTINCT(ORGANIZATION_ID||ENTITY_ID) FROM " + tableName; {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ScanQueryIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ClientTimeArithmeticQueryIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1245//testReport/ Javadoc warnings: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1245//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1245//console This message is automatically generated. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115513#comment-16115513 ] James Taylor commented on PHOENIX-418: -- Thanks for the patch, [~aertoria]. The transitive dependencies of licenses look fine. Here's some feedback: - You should be able to derive DistinctCountHyperLogLogAggregateFunction from DistinctCountAggregateFunction and simply override the getName, newClientAggregator and newServerAggregator methods. The behavior for things like COUNT(DISTINCT 123)) should be the same as APPROXIMATE_COUNT_DISTINCT(123), returning either 0 or 1. The getDataType and evaluate method will be the same as DistinctCountAggregateFunction: the return type should always be of type PLong and the evaluate method will have the same special cases when a constant is used as the argument. {code} + +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public class DistinctCountHyperLogLogAggregateFunction extends DelegateConstantToCountAggregateFunction { +public static final String NAME = "APPROX_COUNT_DISTINCT"; + +public DistinctCountHyperLogLogAggregateFunction() { +} + +public DistinctCountHyperLogLogAggregateFunction(List childExpressions){ +super(childExpressions, null); +} + +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ +super(childExpressions, delegate); +} + + +private Aggregator newAggregatorServer(final PDataType type, SortOrder sortOrder, ImmutableBytesWritable ptr) { + return new HyperLogLogServerAggregator(sortOrder, ptr) { +@Override +protected PDataType getInputDataType() { + return type; +} + }; +} {code} - Minor nit, but if you include an @since tag, make the release the *next* release (i.e. the release this function will appear in), not the current release: {code} + * @since 4.11 {code} > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114818#comment-16114818 ] James Taylor commented on PHOENIX-418: -- Thanks for the patch, [~aertoria]. This is a good new feature. Please revert the Phoenix grammar changes as they aren't needed since the new feature is surfaced as a new built-in aggregate function. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114526#comment-16114526 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12880344/PHOENIX-418-v2.patch against master branch at commit 0bd43f536a13609b35629308c415aebc800d6799. ATTACHMENT ID: 12880344 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 54 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "SELECT APPROX_COUNT_DISTINCT(ORGANIZATION_ID||A_STRING||B_STRING) FROM " + tableName; +String tableName = initATableValues(null, tenantId, getDefaultSplits(tenantId), (Date)null, null, getUrl(), null); +String query = "SELECT APPROX_COUNT_DISTINCT(a.ORGANIZATION_ID||a.A_STRING||b.ENTITY_ID) FROM " + + tableName + " a, "+ tableName + " b where a.ORGANIZATION_ID=b.ORGANIZATION_ID and a.ENTITY_ID = b.ENTITY_ID"; +{ ParseContext context = contextStack.peek(); $ret = factory.select(u, order, l, approximate, o, getBindCount(), context.isAggregate()); } +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public class DistinctCountHyperLogLogAggregateFunction extends DelegateConstantToCountAggregateFunction { +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ConcurrentMutationsIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.NotQueryIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1243//testReport/ Javadoc warnings: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1243//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1243//console This message is automatically generated. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113913#comment-16113913 ] Ethan Wang commented on PHOENIX-418: Thanks [~jamestaylor] [~sergey.soldatov]. My bad. I go prepare another one. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112203#comment-16112203 ] Sergey Soldatov commented on PHOENIX-418: - [~aertoria] if you have several others commits on top of yours, you may get a single commit by command: {code} git format-patch -k -1 {code} Also make sure that it can be applied on top of the master branch (git am ). If it's not, rebase it before submitting to the JIRA. You also may just resolve the conflicts during am and get a new patch with format-patch. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112172#comment-16112172 ] James Taylor commented on PHOENIX-418: -- Patch doesn't look right - it includes commits outside of yours. Try generating with the following command after committing it to your local repo: {code} git format-patch --stdout HEAD^ > PHOENIX-{NUMBER}.patch {code} > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110419#comment-16110419 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12879968/PHOENIX-418-v1.patch against master branch at commit 5e33dc12bc088bd0008d89f0a5cd7d5c368efa25. ATTACHMENT ID: 12879968 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified tests. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1242//console This message is automatically generated. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108353#comment-16108353 ] James Taylor commented on PHOENIX-418: -- Let's just keep it simple and support APPROX_COUNT_DISTINCT. That way we don't need any grammar changes. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108332#comment-16108332 ] Ethan Wang commented on PHOENIX-418: +1 on {quote}select count(distinct name) from person APPROXIMATE () select count(distinct name) from person APPROXIMATE (ALGORITHM 'hll') select count(distinct name) from person APPROXIMATE (ALGORITHM 'ABC' WITHIN 10 PERCENT) select count(distinct name) APPROXIMATE () from person select count(distinct name) APPROXIMATE (ALGORITHM 'hll') from person select count(distinct name) APPROXIMATE (ALGORITHM 'ABC' WITHIN 10 PERCENT) from person {quote} The current patch that is in progress basically align with this suggestion, supporting both APPROX_COUNT_DISTINCT(xxx) and select count(name) from person APPROXIMATE ('hll') Also, for hyperLogLog algorithm, suggesting PHOENIX to be as similar as Druid(CALCITE-1588), in that: 1, not including accuracy option for user to specify. 2, hard coded the two parameters during HLL initialization. (i.e., the precision value for the normal set and the precision value for the sparse set) > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107832#comment-16107832 ] Julian Hyde commented on PHOENIX-418: - In CALCITE-1588 [~gian] points out that Oracle, BigQuery and MemSQL support {{APPROX_COUNT_DISTINCT}}. I also see it in VoltDB. A quick survey of other databases: * Vertica has {{APPROXIMATE_COUNT_DISTINCT}} * Redshift has {{[ APPROXIMATE ] COUNT ( [ DISTINCT | ALL ] * | expression )}}. * In PostgreSQL you can bolt on your own hyperloglog function, but there doesn't seem to have a unified approach. * I don't see anything in DB2 or MySQL I think that is a sufficient de facto standard to support {{APPROX_COUNT_DISTINCT}} in Calcite. Also I think Calcite should support an APPROXIMATE clause. Allowed both as a clause in the SELECT statement, and also after the aggregate function (but before the OVER clause, if present). Algorithm and any parameters go inside parentheses. Examples: {code} select count(distinct name) from person APPROXIMATE () select count(distinct name) from person APPROXIMATE (ALGORITHM 'hll') select count(distinct name) from person APPROXIMATE (ALGORITHM 'ABC' WITHIN 10 PERCENT) select count(distinct name) APPROXIMATE () from person select count(distinct name) APPROXIMATE (ALGORITHM 'hll') from person select count(distinct name) APPROXIMATE (ALGORITHM 'ABC' WITHIN 10 PERCENT) from person {code} For now, I think Phoenix should support APPROX_COUNT_DISTINCT. We could add support for the APPROXIMATE clause later. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106864#comment-16106864 ] Ethan Wang commented on PHOENIX-418: Regarding the syntax of approximate discount. Carry on from the discussion from PHOENIX-3390, purposing the syntax to be Original cardinality count function: select count(distinct name) from person With approximate: select count(distinct name) from person APPROXIMATE select count(distinct name) from person APPROXIMATE 'hll' select count(distinct name) from person APPROXIMATE 'algorithm ABC' (WITHIN 10 PERCENT) > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052447#comment-16052447 ] Ethan Wang commented on PHOENIX-418: What is the current state of this jira? @maghamravikiran [~maghamraviki...@gmail.com] > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: maghamravikiran > Labels: gsoc2016 > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052446#comment-16052446 ] Ethan Wang commented on PHOENIX-418: Reply to [~pctony]: also in Blink DB, they have similar approximate aggregate feature which they use a sql clause. Example would be: `select count(id) From table ERROR 0.1 CONFIDENCE 95% > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: maghamravikiran > Labels: gsoc2016 > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191489#comment-15191489 ] Sergey Soldatov commented on PHOENIX-418: - In Hive there is a 3rd party UDF implemented HyperLogLog algo. Can it be done in Phoenix in the same way? > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: maghamravikiran > Labels: gsoc2016 > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.3.4#6332)