[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-30 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148425#comment-16148425
 ] 

Lars Hofhansl commented on PHOENIX-418:
---

You guys kick a**. :)

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: New Feature
>Reporter: James Taylor
>Assignee: Ethan Wang
>Priority: Blocker
>  Labels: gsoc2016
> Fix For: 4.12.0
>
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch, PHOENIX-418-v6-pom.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Source code patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16148175#comment-16148175
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12884517/PHOENIX-418-v6-pom.patch
  against master branch at commit e6c1f01c5f7d5c9996017714efe90202a95355b2.
  ATTACHMENT ID: 12884517

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+0 tests included{color}.  The patch appears to be a 
documentation, build,
or dev patch that doesn't require tests.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
62 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 lineLengths{color}.  The patch does not introduce lines 
longer than 100

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexFailureIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.GroupByIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TableSnapshotReadsMapReduceIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1325//testReport/
Javadoc warnings: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1325//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1325//console

This message is automatically generated.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: New Feature
>Reporter: James Taylor
>Assignee: Ethan Wang
>Priority: Blocker
>  Labels: gsoc2016
> Fix For: 4.12.0
>
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch, PHOENIX-418-v6-pom.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Source code patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-30 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147867#comment-16147867
 ] 

James Taylor commented on PHOENIX-418:
--

+1. Thanks for the quick turnaround, [~aertoria] and for the help 
[~mujtabachohan].

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: New Feature
>Reporter: James Taylor
>Assignee: Ethan Wang
>Priority: Blocker
>  Labels: gsoc2016
> Fix For: 4.12.0
>
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch, PHOENIX-418-v6-pom.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Source code patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-30 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147721#comment-16147721
 ] 

James Taylor commented on PHOENIX-418:
--

By updating the assembly of the phoenix server jar in phoenix-assembly, you can 
get the clearspring classes bundled in the jar so that copy isn't necessary.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: New Feature
>Reporter: James Taylor
>Assignee: Ethan Wang
>Priority: Blocker
>  Labels: gsoc2016
> Fix For: 4.12.0
>
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Source code patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-30 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147704#comment-16147704
 ] 

Ethan Wang commented on PHOENIX-418:


I see. I forgot to mention that, for me the local test to work, I have to copy 
both the compiled phoenix-server.jar as well as a stream.jar to hbase lib 
folder. I will go prepare another patch

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: New Feature
>Reporter: James Taylor
>Assignee: Ethan Wang
>Priority: Blocker
>  Labels: gsoc2016
> Fix For: 4.12.0
>
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Source code patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-30 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16147690#comment-16147690
 ] 

James Taylor commented on PHOENIX-418:
--

[~aertoria] - please see comment above. Need to get the clearspring library 
into the server jar. This is controlled by phoenix-assembly. [~mujtabachohan], 
[~samarthjain], or [~tdsilva] can help if you run into any issues.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: New Feature
>Reporter: James Taylor
>Assignee: Ethan Wang
>Priority: Blocker
>  Labels: gsoc2016
> Fix For: 4.12.0
>
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Source code patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-30 Thread Lars Hofhansl (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146707#comment-16146707
 ] 

Lars Hofhansl commented on PHOENIX-418:
---

Shouldn't the clearspring library be part of the server jar? I just did a clean 
build from scratch, and indeed APPROX_COUNT_DISTINCT fails with
{code}
...
Caused by: java.lang.NoClassDefFoundError: 
com/clearspring/analytics/stream/cardinality/HyperLogLogPlus
...
{code}
thrown at the server.


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Source code patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144645#comment-16144645
 ] 

Hudson commented on PHOENIX-418:


FAILURE: Integrated in Jenkins build Phoenix-master #1753 (See 
[https://builds.apache.org/job/Phoenix-master/1753/])
PHOENIX-418 Support approximate COUNT DISTINCT (Ethan Wang) (jtaylor: rev 
d6381afc3af976ccdbb874d4458ea17b1e8a1d32)
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/DistinctCountHyperLogLogAggregateFunction.java
* (add) 
phoenix-core/src/it/java/org/apache/phoenix/end2end/CountDistinctApproximateHyperLogLogIT.java
* (edit) dev/release_files/NOTICE
* (edit) phoenix-core/pom.xml
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/expression/ExpressionType.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/parse/DistinctCountHyperLogLogAggregateParseNode.java


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Merged patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144482#comment-16144482
 ] 

James Taylor commented on PHOENIX-418:
--

Do you know what the difference are, [~mujtabachohan] between "ubuntu-eu2" 
versus "H4"? Different JVMs?

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144459#comment-16144459
 ] 

James Taylor commented on PHOENIX-418:
--

Thanks, [~aertoria]. I've pushed your last patch to 4.x and master branches. 
Nice work!

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144380#comment-16144380
 ] 

Ethan Wang commented on PHOENIX-418:


Changes has been made to pom for using 2.9.5. The flaky IT tests went away with 
this batch (v6).  
[~jamestaylor]

During the investigation, one thing observed is that both success runs happens 
when the Jenkins node "ubuntu-eu2" pick it up (versus "H4" during the weekend). 
I'm not sure if this is related. [~samarthjain]

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144378#comment-16144378
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12884107/PHOENIX-418-v6.patch
  against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790.
  ATTACHMENT ID: 12884107

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
56 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) 
FROM " + tableName + " a, " + tableName
+   final private void prepareTableWithValues(final Connection conn, final 
int nRows) throws Exception {
+   final PreparedStatement stmt = conn.prepareStatement("upsert 
into " + tableName + " VALUES (?, ?)");
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   protected final ImmutableBytesWritable valueByteArray = new 
ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY);
+public DistinctCountHyperLogLogAggregateParseNode(String name, 
List children, BuiltInFunctionInfo info) {
+public FunctionExpression create(List children, 
StatementContext context) throws SQLException {

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//testReport/
Javadoc warnings: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//console

This message is automatically generated.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144178#comment-16144178
 ] 

James Taylor commented on PHOENIX-418:
--

[~aertoria]. Thanks for the updated patch. Please update the 
phoenix-core/pom.xml to use the 2.9.5 version of 
com.clearspring.analytics:stream instead of 2.7.0 if there's no reason not to 
go with the latest.

Any idea why the test runs are so flaky? I'm going to try running them locally 
too, but it seems like more than the usual suspects failing.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143482#comment-16143482
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12883988/PHOENIX-418-v5.patch
  against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790.
  ATTACHMENT ID: 12883988

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
56 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) 
FROM " + tableName + " a, " + tableName
+   final private void prepareTableWithValues(final Connection conn, final 
int nRows) throws Exception {
+   final PreparedStatement stmt = conn.prepareStatement("upsert 
into " + tableName + " VALUES (?, ?)");
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   protected final ImmutableBytesWritable valueByteArray = new 
ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY);
+public DistinctCountHyperLogLogAggregateParseNode(String name, 
List children, BuiltInFunctionInfo info) {
+public FunctionExpression create(List children, 
StatementContext context) throws SQLException {

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CsvBulkLoadToolIT

[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142964#comment-16142964
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12883953/PHOENIX-418-v9.patch
  against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790.
  ATTACHMENT ID: 12883953

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
56 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) 
FROM " + tableName + " a, " + tableName
+   final private void prepareTableWithValues(final Connection conn, final 
int nRows) throws Exception {
+   final PreparedStatement stmt = conn.prepareStatement("upsert 
into " + tableName + " VALUES (?, ?)");
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   protected final ImmutableBytesWritable valueByteArray = new 
ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY);
+public DistinctCountHyperLogLogAggregateParseNode(String name, 
List children, BuiltInFunctionInfo info) {
+public FunctionExpression create(List children, 
StatementContext context) throws SQLException {

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CsvBulkLoadToolIT

[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16142533#comment-16142533
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12883785/PHOENIX-418-v7.patch
  against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790.
  ATTACHMENT ID: 12883785

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
56 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) 
FROM " + tableName + " a, " + tableName
+   final private void prepareTableWithValues(final Connection conn, final 
int nRows) throws Exception {
+   final PreparedStatement stmt = conn.prepareStatement("upsert 
into " + tableName + " VALUES (?, ?)");
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   protected final ImmutableBytesWritable valueByteArray = new 
ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY);
+public DistinctCountHyperLogLogAggregateParseNode(String name, 
List children, BuiltInFunctionInfo info) {
+public FunctionExpression create(List children, 
StatementContext context) throws SQLException {

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexFailureIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1305//testReport/
Javadoc warnings: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1305//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1305//console

This message is automatically generated.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch, PHOENIX-418-v7.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-25 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141767#comment-16141767
 ] 

Josh Elser commented on PHOENIX-418:


bq. So we should add that text to our NOTICE file, right?

Correct, specifically to {{dev/release_files/NOTICE}} only (not the top-level 
NOTICE file as that is for our src-release which would not actually contain 
this JAR).

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141430#comment-16141430
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12883652/PHOENIX-418-v6.patch
  against master branch at commit dd008f4fe4e67836a6fb362843c9618e4e518182.
  ATTACHMENT ID: 12883652

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
56 warning messages.

{color:red}-1 release audit{color}.  The applied patch generated 2 release 
audit warnings (more than the master's current 0 warnings).

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) 
FROM " + tableName + " a, " + tableName
+   final private void prepareTableWithValues(final Connection conn, final 
int nRows) throws Exception {
+   final PreparedStatement stmt = conn.prepareStatement("upsert 
into " + tableName + " VALUES (?, ?)");
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   protected final ImmutableBytesWritable valueByteArray = new 
ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY);
+public DistinctCountHyperLogLogAggregateParseNode(String name, 
List children, BuiltInFunctionInfo info) {
+public FunctionExpression create(List children, 
StatementContext context) throws SQLException {

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT

[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-24 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141219#comment-16141219
 ] 

James Taylor commented on PHOENIX-418:
--

Thanks for the info, [~elserj]. So we should add that text to our NOTICE file, 
right?

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-24 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141126#comment-16141126
 ] 

Ethan Wang commented on PHOENIX-418:


Thanks [~elserj].
2.9.5 should be equally just fine.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-24 Thread Josh Elser (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141100#comment-16141100
 ] 

Josh Elser commented on PHOENIX-418:


[~jamestaylor], thanks for thinking about the licensing! Turns out, there is 
something that we should propagate in the binary tarball

https://github.com/addthis/stream-lib/blob/master/NOTICE.txt

Specifically, we should include text, something like:

{noformat}
stream-lib
Copyright 2016 AddThis

This product includes software developed by AddThis.
{noformat}

Aside, it appears that there is a 2.9.5 version of 
{{com.clearspring.analytics:stream}} in maven-central. Is there a reason for us 
to use 2.7.0, specifically?

Neat little change, Ethan. I look forward to playing around with the feature :)

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-24 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141019#comment-16141019
 ] 

James Taylor commented on PHOENIX-418:
--

One more thing: are any changes required to phoenix-assembly for the new jar, 
[~elserj] or [~mujtabachohan]? Might want to double check the license 
dependencies too, Josh. They looked ok to me.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-24 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141013#comment-16141013
 ] 

James Taylor commented on PHOENIX-418:
--

+1. I'll remove CountDistinctApproximateHyperLogLogStatsEnabledIT as I don't 
think we need that. Your local test runs look ok, [~aertoria]?

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140890#comment-16140890
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12883574/PHOENIX-418-v5.patch
  against master branch at commit dd008f4fe4e67836a6fb362843c9618e4e518182.
  ATTACHMENT ID: 12883574

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
56 warning messages.

{color:red}-1 release audit{color}.  The applied patch generated 2 release 
audit warnings (more than the master's current 0 warnings).

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "SELECT 
APPROX_COUNT_DISTINCT(ORGANIZATION_ID||A_STRING||B_STRING) FROM " + tableName;
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "SELECT 
APPROX_COUNT_DISTINCT(a.ORGANIZATION_ID||a.A_STRING||b.ENTITY_ID) FROM "
+   + tableName + " a, "+ tableName + " b where 
a.ORGANIZATION_ID=b.ORGANIZATION_ID and a.ENTITY_ID = b.ENTITY_ID";
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "explain SELECT 
APPROX_COUNT_DISTINCT(ORGANIZATION_ID||ENTITY_ID) FROM " + tableName;

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CsvBulkLoadToolIT

[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-24 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140664#comment-16140664
 ] 

James Taylor commented on PHOENIX-418:
--

bq. I think it also makes sense to have another test derived from 
ParallelStatsEnabledIT
Never mind about this - I was thinking of TABLESAMPLE, but you've already done 
this.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-22 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16137402#comment-16137402
 ] 

James Taylor commented on PHOENIX-418:
--

Thanks for the revised patch, [~aertoria]. Looks very good. A couple of minor 
things:
- derive your test from ParallelStatsDisabledIT instead of 
BaseUniqueNamesOwnClusterIT and remove the setup method which you won't need. 
The advantage of ParallelStatsDisabledIT tests are that they don't need to each 
spin up a new mini cluster when they run so are overall test run time stays 
lower.
{code}
+public class CountDistinctApproximateHyperLogLogIT extends 
BaseUniqueNamesOwnClusterIT {
+@BeforeClass
+public static void doSetup() throws Exception {
+Map props = Maps.newHashMapWithExpectedSize(3);
+setUpTestDriver(new ReadOnlyProps(props.entrySet().iterator()));
+}
+
{code}
- I think it also makes sense to have another test derived from 
ParallelStatsEnabledIT. This base test class is configured to collect 
statistics. In this way, you can get more test coverage. You can likely run the 
exact same tests, but in this case you'll have guideposts in place (because 
stats will be collected). Make sure to call TestUtil.analyzeTable(connection, 
fullTableName) prior to running your TABLESAMPLE queries. You'll get more rows 
back, since you'll have guideposts in addition to region boundaries.
- minor nit, extra semicolon here:
{code}
+
DistinctCountHyperLogLogAggregateFunction(DistinctCountHyperLogLogAggregateFunction.class);;
{code}
- Instead of always copying the underlying byte buffer, use 
ByteUtil.copyKeyBytesIfNecessary(ImmutableBytesWritable ptr) instead which only 
copies when necessary:
+   @Override
+   public boolean evaluate(Tuple tuple, ImmutableBytesWritable ptr) {  
+   try {
+   valueByteArray.set(hll.getBytes(), 0, 
hll.getBytes().length);
+   ptr.set(valueByteArray.copyBytes());
{code}

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115982#comment-16115982
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12880580/PHOENIX-418-v4.patch
  against master branch at commit bf3c13b6d169e90e06c54a338a5931163a32d743.
  ATTACHMENT ID: 12880580

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
54 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "SELECT 
APPROX_COUNT_DISTINCT(ORGANIZATION_ID||A_STRING||B_STRING) FROM " + tableName;
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "SELECT 
APPROX_COUNT_DISTINCT(a.ORGANIZATION_ID||a.A_STRING||b.ENTITY_ID) FROM " 
+   + tableName + " a, "+ tableName + " b where 
a.ORGANIZATION_ID=b.ORGANIZATION_ID and a.ENTITY_ID = b.ENTITY_ID";
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "explain SELECT 
APPROX_COUNT_DISTINCT(ORGANIZATION_ID||ENTITY_ID) FROM " + tableName;

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.execute.PartialCommitIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ProductMetricsIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ScanQueryIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryDatabaseMetaDataIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.NotQueryIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ClientTimeArithmeticQueryIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1246//testReport/
Javadoc warnings: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1246//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1246//console

This message is automatically generated.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115574#comment-16115574
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12880542/PHOENIX-418-v3.patch
  against master branch at commit bf3c13b6d169e90e06c54a338a5931163a32d743.
  ATTACHMENT ID: 12880542

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
54 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "SELECT 
APPROX_COUNT_DISTINCT(ORGANIZATION_ID||A_STRING||B_STRING) FROM " + tableName;
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "SELECT 
APPROX_COUNT_DISTINCT(a.ORGANIZATION_ID||a.A_STRING||b.ENTITY_ID) FROM " 
+   + tableName + " a, "+ tableName + " b where 
a.ORGANIZATION_ID=b.ORGANIZATION_ID and a.ENTITY_ID = b.ENTITY_ID";
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "explain SELECT 
APPROX_COUNT_DISTINCT(ORGANIZATION_ID||ENTITY_ID) FROM " + tableName;

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ScanQueryIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ClientTimeArithmeticQueryIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1245//testReport/
Javadoc warnings: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1245//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1245//console

This message is automatically generated.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-05 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16115513#comment-16115513
 ] 

James Taylor commented on PHOENIX-418:
--

Thanks for the patch, [~aertoria]. The transitive dependencies of licenses look 
fine. Here's some feedback:
- You should be able to derive DistinctCountHyperLogLogAggregateFunction from 
DistinctCountAggregateFunction and simply override the getName, 
newClientAggregator and newServerAggregator methods. The behavior for things 
like COUNT(DISTINCT 123)) should be the same as 
APPROXIMATE_COUNT_DISTINCT(123), returning either 0 or 1. The getDataType and 
evaluate method will be the same as DistinctCountAggregateFunction: the return 
type should always be of type PLong and the evaluate method will have the same 
special cases when a constant is used as the argument.
{code}
+
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public class DistinctCountHyperLogLogAggregateFunction extends 
DelegateConstantToCountAggregateFunction {
+public static final String NAME = "APPROX_COUNT_DISTINCT";
+
+public DistinctCountHyperLogLogAggregateFunction() {
+}
+
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions){
+super(childExpressions, null);
+}
+
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){
+super(childExpressions, delegate);
+}
+
+
+private Aggregator newAggregatorServer(final PDataType type, SortOrder 
sortOrder, ImmutableBytesWritable ptr) {
+   return new HyperLogLogServerAggregator(sortOrder, ptr) {
+@Override
+protected PDataType getInputDataType() {
+  return type;
+}
+  };
+}
{code}
- Minor nit, but if you include an @since tag, make the release the *next* 
release (i.e. the release this function will appear in), not the current 
release:
{code}
+ * @since 4.11
{code}

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-04 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114818#comment-16114818
 ] 

James Taylor commented on PHOENIX-418:
--

Thanks for the patch, [~aertoria]. This is a good new feature. Please revert 
the Phoenix grammar changes as they aren't needed since the new feature is 
surfaced as a new built-in aggregate function.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16114526#comment-16114526
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12880344/PHOENIX-418-v2.patch
  against master branch at commit 0bd43f536a13609b35629308c415aebc800d6799.
  ATTACHMENT ID: 12880344

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
54 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "SELECT 
APPROX_COUNT_DISTINCT(ORGANIZATION_ID||A_STRING||B_STRING) FROM " + tableName;
+String tableName = initATableValues(null, tenantId, 
getDefaultSplits(tenantId), (Date)null, null, getUrl(), null);
+String query = "SELECT 
APPROX_COUNT_DISTINCT(a.ORGANIZATION_ID||a.A_STRING||b.ENTITY_ID) FROM " 
+   + tableName + " a, "+ tableName + " b where 
a.ORGANIZATION_ID=b.ORGANIZATION_ID and a.ENTITY_ID = b.ENTITY_ID";
+{ ParseContext context = contextStack.peek(); $ret = factory.select(u, 
order, l, approximate, o, getBindCount(), context.isAggregate()); }
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public class DistinctCountHyperLogLogAggregateFunction extends 
DelegateConstantToCountAggregateFunction {
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.ConcurrentMutationsIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.NotQueryIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1243//testReport/
Javadoc warnings: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1243//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1243//console

This message is automatically generated.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-03 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113913#comment-16113913
 ] 

Ethan Wang commented on PHOENIX-418:


Thanks [~jamestaylor] [~sergey.soldatov]. My bad. I go prepare another one.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-02 Thread Sergey Soldatov (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112203#comment-16112203
 ] 

Sergey Soldatov commented on PHOENIX-418:
-

[~aertoria] if you have several others commits on top of yours, you may get a 
single commit by command:
{code}
git format-patch -k -1 
{code}
Also make sure that it can be applied on top of the master branch (git am 
). If it's not, rebase it before submitting to the JIRA. You also 
may just resolve the conflicts during am and get a new patch with format-patch.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-02 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16112172#comment-16112172
 ] 

James Taylor commented on PHOENIX-418:
--

Patch doesn't look right - it includes commits outside of yours. Try generating 
with the following command after committing it to your local repo:
{code}
git format-patch --stdout HEAD^ > PHOENIX-{NUMBER}.patch
{code}

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16110419#comment-16110419
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12879968/PHOENIX-418-v1.patch
  against master branch at commit 5e33dc12bc088bd0008d89f0a5cd7d5c368efa25.
  ATTACHMENT ID: 12879968

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 4 new 
or modified tests.

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1242//console

This message is automatically generated.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-07-31 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108353#comment-16108353
 ] 

James Taylor commented on PHOENIX-418:
--

Let's just keep it simple and support APPROX_COUNT_DISTINCT. That way we don't 
need any grammar changes.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-07-31 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108332#comment-16108332
 ] 

Ethan Wang commented on PHOENIX-418:


+1 on 
{quote}select count(distinct name)
from person
APPROXIMATE ()

select count(distinct name)
from person
APPROXIMATE (ALGORITHM 'hll')

select count(distinct name)
from person
APPROXIMATE (ALGORITHM 'ABC' WITHIN 10 PERCENT)

select count(distinct name) APPROXIMATE ()
from person

select count(distinct name) APPROXIMATE (ALGORITHM 'hll')
from person

select count(distinct name)
  APPROXIMATE (ALGORITHM 'ABC' WITHIN 10 PERCENT)
from person {quote}


The current patch that is in progress basically align with this suggestion, 
supporting both 
APPROX_COUNT_DISTINCT(xxx) 
and 
select count(name) from person APPROXIMATE ('hll')

Also, for hyperLogLog algorithm, suggesting PHOENIX to be as similar as 
Druid(CALCITE-1588), in that:
1, not including accuracy option for user to specify.
2, hard coded the two parameters during HLL initialization. (i.e., the 
precision value for the normal set and the precision value for the sparse set)


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-07-31 Thread Julian Hyde (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107832#comment-16107832
 ] 

Julian Hyde commented on PHOENIX-418:
-

In CALCITE-1588 [~gian] points out that Oracle, BigQuery and MemSQL support 
{{APPROX_COUNT_DISTINCT}}. I also see it in VoltDB.

A quick survey of other databases:
* Vertica has {{APPROXIMATE_COUNT_DISTINCT}}
* Redshift has {{[ APPROXIMATE ] COUNT ( [ DISTINCT | ALL ] * | expression )}}.
* In PostgreSQL you can bolt on your own hyperloglog function, but there 
doesn't seem to have a unified approach.
* I don't see anything in DB2 or MySQL

I think that is a sufficient de facto standard to support 
{{APPROX_COUNT_DISTINCT}} in Calcite.

Also I think Calcite should support an APPROXIMATE clause. Allowed both as a 
clause in the SELECT statement, and also after the aggregate function (but 
before the OVER clause, if present). Algorithm and any parameters go inside 
parentheses. Examples:

{code}
select count(distinct name)
from person
APPROXIMATE ()

select count(distinct name)
from person
APPROXIMATE (ALGORITHM 'hll')

select count(distinct name)
from person
APPROXIMATE (ALGORITHM 'ABC' WITHIN 10 PERCENT)

select count(distinct name) APPROXIMATE ()
from person

select count(distinct name) APPROXIMATE (ALGORITHM 'hll')
from person

select count(distinct name)
  APPROXIMATE (ALGORITHM 'ABC' WITHIN 10 PERCENT)
from person 
{code}

For now, I think Phoenix should support APPROX_COUNT_DISTINCT. We could add 
support for the APPROXIMATE clause later.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-07-31 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106864#comment-16106864
 ] 

Ethan Wang commented on PHOENIX-418:


Regarding the syntax of approximate discount.  Carry on from the discussion 
from PHOENIX-3390, purposing the syntax to be

Original cardinality count function:
select count(distinct name) from person

With approximate:
select count(distinct name) from person APPROXIMATE 
select count(distinct name) from person APPROXIMATE 'hll' 
select count(distinct name) from person APPROXIMATE 'algorithm ABC' (WITHIN 10 
PERCENT)


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-06-16 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052447#comment-16052447
 ] 

Ethan Wang commented on PHOENIX-418:


What is the current state of this jira?
@maghamravikiran
[~maghamraviki...@gmail.com]

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: maghamravikiran
>  Labels: gsoc2016
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-06-16 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16052446#comment-16052446
 ] 

Ethan Wang commented on PHOENIX-418:


Reply to [~pctony]: also in Blink DB, they have similar approximate aggregate 
feature which they use a sql clause. Example would be:
`select count(id) From table ERROR 0.1 CONFIDENCE 95%

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: maghamravikiran
>  Labels: gsoc2016
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2016-03-11 Thread Sergey Soldatov (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15191489#comment-15191489
 ] 

Sergey Soldatov commented on PHOENIX-418:
-

In Hive there is a 3rd party UDF implemented HyperLogLog algo. Can it be done 
in Phoenix in the same way?

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: maghamravikiran
>  Labels: gsoc2016
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)