[jira] [Comment Edited] (PHOENIX-3999) Optimize inner joins as SKIP-SCAN-JOIN when possible

2017-08-28 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144780#comment-16144780
 ] 

Maryann Xue edited comment on PHOENIX-3999 at 8/29/17 5:53 AM:
---

The reason why hash cache is used in this query is that there is 
"b.BATCH_SEQUENCE_NUM" in the SELECT clause, so we have to do the actual join 
operation for referenced fields (in this case, only 1 field) from RHS. 
Although, in some sense the join is driven "by the client side" through 
skip-scan filter, only that we cannot omit the join operation as we would for 
semi-joins. So my point is, for this query, the best has been done.


was (Author: maryannxue):
The reason why hash cache is used in this query is that there is 
"b.BATCH_SEQUENCE_NUM" in the SELECT clause, so we have to do the actual join 
operation for referenced fields (in this case, only 1 field) from RHS. 
Although, in some sense the join is driven "by the client side" through 
skip-scan filter, only that we cannot omit the join operation as we would for 
semi-joins.

> Optimize inner joins as SKIP-SCAN-JOIN when possible
> 
>
> Key: PHOENIX-3999
> URL: https://issues.apache.org/jira/browse/PHOENIX-3999
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>
> Semi joins on the leading part of the primary key end up doing batches of 
> point queries (as opposed to a broadcast hash join), however inner joins do 
> not.
> Here's a set of example schemas that executes a skip scan on the inner query:
> {code}
> CREATE TABLE COMPLETED_BATCHES (
> BATCH_SEQUENCE_NUM BIGINT NOT NULL,
> BATCH_ID   BIGINT NOT NULL,
> CONSTRAINT PK PRIMARY KEY
> (
> BATCH_SEQUENCE_NUM,
> BATCH_ID
> )
> );
> CREATE TABLE ITEMS (
>BATCH_ID BIGINT NOT NULL,
>ITEM_ID BIGINT NOT NULL,
>ITEM_TYPE BIGINT,
>ITEM_VALUE VARCHAR,
>CONSTRAINT PK PRIMARY KEY
>(
> BATCH_ID,
> ITEM_ID
>)
> );
> CREATE TABLE COMPLETED_ITEMS (
>ITEM_TYPE  BIGINT NOT NULL,
>BATCH_SEQUENCE_NUM BIGINT NOT NULL,
>ITEM_IDBIGINT NOT NULL,
>ITEM_VALUE VARCHAR,
>CONSTRAINT PK PRIMARY KEY
>(
>   ITEM_TYPE,
>   BATCH_SEQUENCE_NUM,  
>   ITEM_ID
>)
> );
> {code}
> The explain plan of these indicate that a dynamic filter will be performed 
> like this:
> {code}
> UPSERT SELECT
> CLIENT PARALLEL 1-WAY FULL SCAN OVER ITEMS
> SKIP-SCAN-JOIN TABLE 0
> CLIENT PARALLEL 1-WAY RANGE SCAN OVER COMPLETED_BATCHES [1] - [2]
> SERVER FILTER BY FIRST KEY ONLY
> SERVER AGGREGATE INTO DISTINCT ROWS BY [BATCH_ID]
> CLIENT MERGE SORT
> DYNAMIC SERVER FILTER BY I.BATCH_ID IN ($8.$9)
> {code}
> We should also be able to leverage this optimization when an inner join is 
> used such as this:
> {code}
> UPSERT INTO COMPLETED_ITEMS (ITEM_TYPE, BATCH_SEQUENCE_NUM, ITEM_ID, 
> ITEM_VALUE)
>SELECT i.ITEM_TYPE, b.BATCH_SEQUENCE_NUM, i.ITEM_ID, i.ITEM_VALUE   
>FROM  ITEMS i, COMPLETED_BATCHES b
>WHERE b.BATCH_ID = i.BATCH_ID AND  
>b.BATCH_SEQUENCE_NUM > 1000 AND b.BATCH_SEQUENCE_NUM < 2000;
> {code}
> A complete unit test looks like this:
> {code}
> @Test
> public void testNestedLoopJoin() throws Exception {
> try (Connection conn = DriverManager.getConnection(getUrl())) {
> String t1="COMPLETED_BATCHES";
> String ddl1 = "CREATE TABLE " + t1 + " (\n" + 
> "BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + 
> "BATCH_ID   BIGINT NOT NULL,\n" + 
> "CONSTRAINT PK PRIMARY KEY\n" + 
> "(\n" + 
> "BATCH_SEQUENCE_NUM,\n" + 
> "BATCH_ID\n" + 
> ")\n" + 
> ")" + 
> "";
> conn.createStatement().execute(ddl1);
> 
> String t2="ITEMS";
> String ddl2 = "CREATE TABLE " + t2 + " (\n" + 
> "   BATCH_ID BIGINT NOT NULL,\n" + 
> "   ITEM_ID BIGINT NOT NULL,\n" + 
> "   ITEM_TYPE BIGINT,\n" + 
> "   ITEM_VALUE VARCHAR,\n" + 
> "   CONSTRAINT PK PRIMARY KEY\n" + 
> "   (\n" + 
> "BATCH_ID,\n" + 
> "ITEM_ID\n" + 
> "   )\n" + 
> ")";
> conn.createStatement().execute(ddl2);
> String t3="COMPLETED_ITEMS";
> String ddl3 = "CREATE TABLE " + t3 + "(\n" + 
> "   ITEM_TYPE  BIGINT NOT NULL,\n" + 
> "   BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + 
>   

[jira] [Commented] (PHOENIX-3999) Optimize inner joins as SKIP-SCAN-JOIN when possible

2017-08-28 Thread Maryann Xue (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144780#comment-16144780
 ] 

Maryann Xue commented on PHOENIX-3999:
--

The reason why hash cache is used in this query is that there is 
"b.BATCH_SEQUENCE_NUM" in the SELECT clause, so we have to do the actual join 
operation for referenced fields (in this case, only 1 field) from RHS. 
Although, in some sense the join is driven "by the client side" through 
skip-scan filter, only that we cannot omit the join operation as we would for 
semi-joins.

> Optimize inner joins as SKIP-SCAN-JOIN when possible
> 
>
> Key: PHOENIX-3999
> URL: https://issues.apache.org/jira/browse/PHOENIX-3999
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>
> Semi joins on the leading part of the primary key end up doing batches of 
> point queries (as opposed to a broadcast hash join), however inner joins do 
> not.
> Here's a set of example schemas that executes a skip scan on the inner query:
> {code}
> CREATE TABLE COMPLETED_BATCHES (
> BATCH_SEQUENCE_NUM BIGINT NOT NULL,
> BATCH_ID   BIGINT NOT NULL,
> CONSTRAINT PK PRIMARY KEY
> (
> BATCH_SEQUENCE_NUM,
> BATCH_ID
> )
> );
> CREATE TABLE ITEMS (
>BATCH_ID BIGINT NOT NULL,
>ITEM_ID BIGINT NOT NULL,
>ITEM_TYPE BIGINT,
>ITEM_VALUE VARCHAR,
>CONSTRAINT PK PRIMARY KEY
>(
> BATCH_ID,
> ITEM_ID
>)
> );
> CREATE TABLE COMPLETED_ITEMS (
>ITEM_TYPE  BIGINT NOT NULL,
>BATCH_SEQUENCE_NUM BIGINT NOT NULL,
>ITEM_IDBIGINT NOT NULL,
>ITEM_VALUE VARCHAR,
>CONSTRAINT PK PRIMARY KEY
>(
>   ITEM_TYPE,
>   BATCH_SEQUENCE_NUM,  
>   ITEM_ID
>)
> );
> {code}
> The explain plan of these indicate that a dynamic filter will be performed 
> like this:
> {code}
> UPSERT SELECT
> CLIENT PARALLEL 1-WAY FULL SCAN OVER ITEMS
> SKIP-SCAN-JOIN TABLE 0
> CLIENT PARALLEL 1-WAY RANGE SCAN OVER COMPLETED_BATCHES [1] - [2]
> SERVER FILTER BY FIRST KEY ONLY
> SERVER AGGREGATE INTO DISTINCT ROWS BY [BATCH_ID]
> CLIENT MERGE SORT
> DYNAMIC SERVER FILTER BY I.BATCH_ID IN ($8.$9)
> {code}
> We should also be able to leverage this optimization when an inner join is 
> used such as this:
> {code}
> UPSERT INTO COMPLETED_ITEMS (ITEM_TYPE, BATCH_SEQUENCE_NUM, ITEM_ID, 
> ITEM_VALUE)
>SELECT i.ITEM_TYPE, b.BATCH_SEQUENCE_NUM, i.ITEM_ID, i.ITEM_VALUE   
>FROM  ITEMS i, COMPLETED_BATCHES b
>WHERE b.BATCH_ID = i.BATCH_ID AND  
>b.BATCH_SEQUENCE_NUM > 1000 AND b.BATCH_SEQUENCE_NUM < 2000;
> {code}
> A complete unit test looks like this:
> {code}
> @Test
> public void testNestedLoopJoin() throws Exception {
> try (Connection conn = DriverManager.getConnection(getUrl())) {
> String t1="COMPLETED_BATCHES";
> String ddl1 = "CREATE TABLE " + t1 + " (\n" + 
> "BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + 
> "BATCH_ID   BIGINT NOT NULL,\n" + 
> "CONSTRAINT PK PRIMARY KEY\n" + 
> "(\n" + 
> "BATCH_SEQUENCE_NUM,\n" + 
> "BATCH_ID\n" + 
> ")\n" + 
> ")" + 
> "";
> conn.createStatement().execute(ddl1);
> 
> String t2="ITEMS";
> String ddl2 = "CREATE TABLE " + t2 + " (\n" + 
> "   BATCH_ID BIGINT NOT NULL,\n" + 
> "   ITEM_ID BIGINT NOT NULL,\n" + 
> "   ITEM_TYPE BIGINT,\n" + 
> "   ITEM_VALUE VARCHAR,\n" + 
> "   CONSTRAINT PK PRIMARY KEY\n" + 
> "   (\n" + 
> "BATCH_ID,\n" + 
> "ITEM_ID\n" + 
> "   )\n" + 
> ")";
> conn.createStatement().execute(ddl2);
> String t3="COMPLETED_ITEMS";
> String ddl3 = "CREATE TABLE " + t3 + "(\n" + 
> "   ITEM_TYPE  BIGINT NOT NULL,\n" + 
> "   BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + 
> "   ITEM_IDBIGINT NOT NULL,\n" + 
> "   ITEM_VALUE VARCHAR,\n" + 
> "   CONSTRAINT PK PRIMARY KEY\n" + 
> "   (\n" + 
> "  ITEM_TYPE,\n" + 
> "  BATCH_SEQUENCE_NUM,  \n" + 
> "  ITEM_ID\n" + 
> "   )\n" + 
> ")";
> conn.createStatement().execute(ddl3);
> 

[jira] [Created] (PHOENIX-4138) Create a hard limit on number of indexes per table

2017-08-28 Thread Rahul Shrivastava (JIRA)
Rahul Shrivastava created PHOENIX-4138:
--

 Summary: Create a hard limit on number of indexes per table
 Key: PHOENIX-4138
 URL: https://issues.apache.org/jira/browse/PHOENIX-4138
 Project: Phoenix
  Issue Type: Bug
Reporter: Rahul Shrivastava


There should be a config parameter to impose a hard limit on number of indexes 
per table. There is a SQL Exception 
https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/exception/SQLExceptionCode.java#L260
 , but it gets triggered on the server side  
(https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/coprocessor/MetaDataEndpointImpl.java#L1589)
 . 

We need a client side limit that can be configured via Phoenix config 
parameter. Something like if user create more than lets say 30 indexes per 
table, it would not allow more index creation for the that specific table. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Ethan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wang updated PHOENIX-418:
---
Description: 
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Update:
Syntax of using approximate count distinct as:
select APPROX_COUNT_DISTINCT(name) from person
select APPROX_COUNT_DISTINCT(address||name) from person

It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
hyperloglog, see discuss below.

Source code patch link below, co-authorred with [~swapna]
https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32

  was:
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Update:
Syntax of using approximate count distinct as:
select APPROX_COUNT_DISTINCT(name) from person
select APPROX_COUNT_DISTINCT(address||name) from person

It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
hyperloglog, see discuss below.

Merged patch link below, co-authorred with [~swapna]
https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Source code patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-153) Implement TABLESAMPLE clause

2017-08-28 Thread Ethan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wang updated PHOENIX-153:
---
Description: 
Support the standard SQL TABLESAMPLE clause by implementing a filter that uses 
a skip next hint based on the region boundaries of the table to only return n 
rows per region.

[Update]
Source Code Patch: 
https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25

  was:
Support the standard SQL TABLESAMPLE clause by implementing a filter that uses 
a skip next hint based on the region boundaries of the table to only return n 
rows per region.

[Update]
Patch: 
https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25


> Implement TABLESAMPLE clause
> 
>
> Key: PHOENIX-153
> URL: https://issues.apache.org/jira/browse/PHOENIX-153
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: enhancement
> Fix For: 4.12.0
>
> Attachments: Sampling_Accuracy_Performance.jpg
>
>
> Support the standard SQL TABLESAMPLE clause by implementing a filter that 
> uses a skip next hint based on the region boundaries of the table to only 
> return n rows per region.
> [Update]
> Source Code Patch: 
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-3815) Only disable indexes on which write failures occurred

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144741#comment-16144741
 ] 

James Taylor commented on PHOENIX-3815:
---

I think we should just get rid of ParallelWriterIndexCommitter completely. 
It'll get rid of a bunch of code duplication and we need to know on which index 
the write failed anyway. Do you think we'll take a perf hit and if so anything 
we can do to improve TrackingParallelWriterIndexCommitter?

> Only disable indexes on which write failures occurred
> -
>
> Key: PHOENIX-3815
> URL: https://issues.apache.org/jira/browse/PHOENIX-3815
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>Assignee: Vincent Poon
> Fix For: 4.12.0
>
> Attachments: PHOENIX-3815.v1.patch
>
>
> We currently disable all indexes if any of them fail to be written to. We 
> really only should disable the one in which the write failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-2460) Implement scrutiny command to validate whether or not an index is in sync with the data table

2017-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144711#comment-16144711
 ] 

Hudson commented on PHOENIX-2460:
-

FAILURE: Integrated in Jenkins build Phoenix-master #1754 (See 
[https://builds.apache.org/job/Phoenix-master/1754/])
PHOENIX-2460 Implement scrutiny command to validate whether  or not an 
(jtaylor: rev fc659488361c91b569f15a26dcbab5cbb24c276b)
* (edit) phoenix-core/src/main/java/org/apache/phoenix/util/QueryUtil.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/IndexScrutinyTool.java
* (add) 
phoenix-core/src/test/java/org/apache/phoenix/mapreduce/util/IndexColumnNamesTest.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/SourceTargetColumnNames.java
* (add) 
phoenix-core/src/it/java/org/apache/phoenix/end2end/IndexScrutinyToolIT.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/PhoenixScrutinyJobCounters.java
* (edit) phoenix-core/src/main/java/org/apache/phoenix/util/SchemaUtil.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/IndexScrutinyTableOutput.java
* (edit) phoenix-core/src/test/java/org/apache/phoenix/util/QueryUtilTest.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/IndexScrutinyMapper.java
* (add) 
phoenix-core/src/test/java/org/apache/phoenix/mapreduce/index/IndexScrutinyTableOutputTest.java
* (add) 
phoenix-core/src/test/java/org/apache/phoenix/mapreduce/index/BaseIndexTest.java
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/PhoenixIndexDBWritable.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/mapreduce/util/IndexColumnNames.java


> Implement scrutiny command to validate whether or not an index is in sync 
> with the data table
> -
>
> Key: PHOENIX-2460
> URL: https://issues.apache.org/jira/browse/PHOENIX-2460
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>Assignee: Vincent Poon
> Fix For: 4.12.0
>
> Attachments: PHOENIX-2460.patch
>
>
> We should have a process that runs to verify that an index is valid against a 
> data table and potentially fixes it if discrepancies are found. This could 
> either be a MR job or a low priority background task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144645#comment-16144645
 ] 

Hudson commented on PHOENIX-418:


FAILURE: Integrated in Jenkins build Phoenix-master #1753 (See 
[https://builds.apache.org/job/Phoenix-master/1753/])
PHOENIX-418 Support approximate COUNT DISTINCT (Ethan Wang) (jtaylor: rev 
d6381afc3af976ccdbb874d4458ea17b1e8a1d32)
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/expression/function/DistinctCountHyperLogLogAggregateFunction.java
* (add) 
phoenix-core/src/it/java/org/apache/phoenix/end2end/CountDistinctApproximateHyperLogLogIT.java
* (edit) dev/release_files/NOTICE
* (edit) phoenix-core/pom.xml
* (edit) 
phoenix-core/src/main/java/org/apache/phoenix/expression/ExpressionType.java
* (add) 
phoenix-core/src/main/java/org/apache/phoenix/parse/DistinctCountHyperLogLogAggregateParseNode.java


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Merged patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-3815) Only disable indexes on which write failures occurred

2017-08-28 Thread Vincent Poon (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144631#comment-16144631
 ] 

Vincent Poon commented on PHOENIX-3815:
---

[~jamestaylor] Can you review?  Note this requires 
TrackingParallelWriterIndexCommitter to be set via config, unless we change it 
to make it the default

> Only disable indexes on which write failures occurred
> -
>
> Key: PHOENIX-3815
> URL: https://issues.apache.org/jira/browse/PHOENIX-3815
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>Assignee: Vincent Poon
> Fix For: 4.12.0
>
> Attachments: PHOENIX-3815.v1.patch
>
>
> We currently disable all indexes if any of them fail to be written to. We 
> really only should disable the one in which the write failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-3815) Only disable indexes on which write failures occurred

2017-08-28 Thread Vincent Poon (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Poon updated PHOENIX-3815:
--
Attachment: PHOENIX-3815.v1.patch

> Only disable indexes on which write failures occurred
> -
>
> Key: PHOENIX-3815
> URL: https://issues.apache.org/jira/browse/PHOENIX-3815
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>Assignee: Vincent Poon
> Fix For: 4.12.0
>
> Attachments: PHOENIX-3815.v1.patch
>
>
> We currently disable all indexes if any of them fail to be written to. We 
> really only should disable the one in which the write failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PHOENIX-3815) Only disable indexes on which write failures occurred

2017-08-28 Thread Vincent Poon (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Poon reassigned PHOENIX-3815:
-

Assignee: Vincent Poon  (was: James Taylor)

> Only disable indexes on which write failures occurred
> -
>
> Key: PHOENIX-3815
> URL: https://issues.apache.org/jira/browse/PHOENIX-3815
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>Assignee: Vincent Poon
> Fix For: 4.12.0
>
>
> We currently disable all indexes if any of them fail to be written to. We 
> really only should disable the one in which the write failed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4080) The error message for version mismatch is not accurate.

2017-08-28 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144550#comment-16144550
 ] 

Ethan Wang commented on PHOENIX-4080:
-

[~gjacoby]
Yes I have re-run the ITs in my local and all pass. I also reviewed the IT 
errors in the apache jenkins error, didn't looks like to me related. Thanks for 
following up.

> The error message for version mismatch is not accurate.
> ---
>
> Key: PHOENIX-4080
> URL: https://issues.apache.org/jira/browse/PHOENIX-4080
> Project: Phoenix
>  Issue Type: Wish
>Affects Versions: 4.11.0
>Reporter: Ethan Wang
>Assignee: Ethan Wang
> Attachments: PHOENIX-4080.patch, PHOENIX-4080-v2.patch
>
>
> When accessing a 4.10 running cluster with 4.11 client, it referred as 
> The following servers require an updated phoenix.jar to be put in the 
> classpath of HBase: region=SYSTEM.CATALOG
> It should be phoenix-[version]-server.jar rather than phoenix.jar



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Ethan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wang updated PHOENIX-418:
---
Description: 
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Update:
Syntax of using approximate count distinct as:
select APPROX_COUNT_DISTINCT(name) from person
select APPROX_COUNT_DISTINCT(address||name) from person

It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
hyperloglog, see discuss below.

Merged patch link below, co-authorred with [~swapna]
https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32

  was:
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Update:
Syntax of using approximate count distinct as:
select APPROX_COUNT_DISTINCT(name) from person
select APPROX_COUNT_DISTINCT(address||name) from person

It is equivalent of  Select COUNT(DISTINCT ID) from person. But with much 
smaller memory foot print. Implemented using hyperloglog.

Merged patch link below, co-authorred with [~swapna]
https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. Implemented using 
> hyperloglog, see discuss below.
> Merged patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Ethan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wang updated PHOENIX-418:
---
Description: 
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Update:
Syntax of using approximate count distinct as:
select APPROX_COUNT_DISTINCT(name) from person
select APPROX_COUNT_DISTINCT(address||name) from person

It is equivalent of  Select COUNT(DISTINCT ID) from person. But with much 
smaller memory foot print. Implemented using hyperloglog.

Merged patch link below, co-authorred with [~swapna]
https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32

  was:
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Updated:
Syntax of using approximate count distinct as:
Select COUNT(DISTINCT ID) from person
Select APPROX_COUNT_DISTINCT(ID) from person

https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Update:
> Syntax of using approximate count distinct as:
> select APPROX_COUNT_DISTINCT(name) from person
> select APPROX_COUNT_DISTINCT(address||name) from person
> It is equivalent of  Select COUNT(DISTINCT ID) from person. But with much 
> smaller memory foot print. Implemented using hyperloglog.
> Merged patch link below, co-authorred with [~swapna]
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Ethan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wang updated PHOENIX-418:
---
Description: 
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Updated:
Syntax of using approximate count distinct as:
Select COUNT(DISTINCT ID) from person
Select APPROX_COUNT_DISTINCT(ID) from person

https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32

  was:
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Updated:
Syntax of using approximate count distinct as:


https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Updated:
> Syntax of using approximate count distinct as:
> Select COUNT(DISTINCT ID) from person
> Select APPROX_COUNT_DISTINCT(ID) from person
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Ethan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wang updated PHOENIX-418:
---
Description: 
Support an "approximation" of count distinct to prevent having to hold on to 
all distinct values (since this will not scale well when the number of distinct 
values is huge). The Apache Drill folks have had some interesting discussions 
on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).

Updated:
Syntax of using approximate count distinct as:


https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32

  was:Support an "approximation" of count distinct to prevent having to hold on 
to all distinct values (since this will not scale well when the number of 
distinct values is huge). The Apache Drill folks have had some interesting 
discussions on this 
[here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
 They recommend using  [Welford's 
method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
 I'm open to having a config option that uses exact versus approximate. I don't 
have experience implementing an approximate implementation, so I'm not sure how 
much state is required to keep on the server and return to the client (other 
than realizing it'd be much less that returning all distinct values and their 
counts).


> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).
> Updated:
> Syntax of using approximate count distinct as:
> https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-3999) Optimize inner joins as SKIP-SCAN-JOIN when possible

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144542#comment-16144542
 ] 

James Taylor commented on PHOENIX-3999:
---

bq. only during inner join: ITEMS table as parent will receive a HashCacheImpl 
from RHS in order to look up the b.BATCH_SEQUENCE_NUM

It's good that a skip scan is used during the scan of ITEMS. However, rather 
than scan the RHS and broadcast it, it'd be more efficient to take the same 
approach as with the semi join and drive the join on the client side as the LHS 
is being scanned.

WDYT, [~maryannxue]?

> Optimize inner joins as SKIP-SCAN-JOIN when possible
> 
>
> Key: PHOENIX-3999
> URL: https://issues.apache.org/jira/browse/PHOENIX-3999
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>
> Semi joins on the leading part of the primary key end up doing batches of 
> point queries (as opposed to a broadcast hash join), however inner joins do 
> not.
> Here's a set of example schemas that executes a skip scan on the inner query:
> {code}
> CREATE TABLE COMPLETED_BATCHES (
> BATCH_SEQUENCE_NUM BIGINT NOT NULL,
> BATCH_ID   BIGINT NOT NULL,
> CONSTRAINT PK PRIMARY KEY
> (
> BATCH_SEQUENCE_NUM,
> BATCH_ID
> )
> );
> CREATE TABLE ITEMS (
>BATCH_ID BIGINT NOT NULL,
>ITEM_ID BIGINT NOT NULL,
>ITEM_TYPE BIGINT,
>ITEM_VALUE VARCHAR,
>CONSTRAINT PK PRIMARY KEY
>(
> BATCH_ID,
> ITEM_ID
>)
> );
> CREATE TABLE COMPLETED_ITEMS (
>ITEM_TYPE  BIGINT NOT NULL,
>BATCH_SEQUENCE_NUM BIGINT NOT NULL,
>ITEM_IDBIGINT NOT NULL,
>ITEM_VALUE VARCHAR,
>CONSTRAINT PK PRIMARY KEY
>(
>   ITEM_TYPE,
>   BATCH_SEQUENCE_NUM,  
>   ITEM_ID
>)
> );
> {code}
> The explain plan of these indicate that a dynamic filter will be performed 
> like this:
> {code}
> UPSERT SELECT
> CLIENT PARALLEL 1-WAY FULL SCAN OVER ITEMS
> SKIP-SCAN-JOIN TABLE 0
> CLIENT PARALLEL 1-WAY RANGE SCAN OVER COMPLETED_BATCHES [1] - [2]
> SERVER FILTER BY FIRST KEY ONLY
> SERVER AGGREGATE INTO DISTINCT ROWS BY [BATCH_ID]
> CLIENT MERGE SORT
> DYNAMIC SERVER FILTER BY I.BATCH_ID IN ($8.$9)
> {code}
> We should also be able to leverage this optimization when an inner join is 
> used such as this:
> {code}
> UPSERT INTO COMPLETED_ITEMS (ITEM_TYPE, BATCH_SEQUENCE_NUM, ITEM_ID, 
> ITEM_VALUE)
>SELECT i.ITEM_TYPE, b.BATCH_SEQUENCE_NUM, i.ITEM_ID, i.ITEM_VALUE   
>FROM  ITEMS i, COMPLETED_BATCHES b
>WHERE b.BATCH_ID = i.BATCH_ID AND  
>b.BATCH_SEQUENCE_NUM > 1000 AND b.BATCH_SEQUENCE_NUM < 2000;
> {code}
> A complete unit test looks like this:
> {code}
> @Test
> public void testNestedLoopJoin() throws Exception {
> try (Connection conn = DriverManager.getConnection(getUrl())) {
> String t1="COMPLETED_BATCHES";
> String ddl1 = "CREATE TABLE " + t1 + " (\n" + 
> "BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + 
> "BATCH_ID   BIGINT NOT NULL,\n" + 
> "CONSTRAINT PK PRIMARY KEY\n" + 
> "(\n" + 
> "BATCH_SEQUENCE_NUM,\n" + 
> "BATCH_ID\n" + 
> ")\n" + 
> ")" + 
> "";
> conn.createStatement().execute(ddl1);
> 
> String t2="ITEMS";
> String ddl2 = "CREATE TABLE " + t2 + " (\n" + 
> "   BATCH_ID BIGINT NOT NULL,\n" + 
> "   ITEM_ID BIGINT NOT NULL,\n" + 
> "   ITEM_TYPE BIGINT,\n" + 
> "   ITEM_VALUE VARCHAR,\n" + 
> "   CONSTRAINT PK PRIMARY KEY\n" + 
> "   (\n" + 
> "BATCH_ID,\n" + 
> "ITEM_ID\n" + 
> "   )\n" + 
> ")";
> conn.createStatement().execute(ddl2);
> String t3="COMPLETED_ITEMS";
> String ddl3 = "CREATE TABLE " + t3 + "(\n" + 
> "   ITEM_TYPE  BIGINT NOT NULL,\n" + 
> "   BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + 
> "   ITEM_IDBIGINT NOT NULL,\n" + 
> "   ITEM_VALUE VARCHAR,\n" + 
> "   CONSTRAINT PK PRIMARY KEY\n" + 
> "   (\n" + 
> "  ITEM_TYPE,\n" + 
> "  BATCH_SEQUENCE_NUM,  \n" + 
> "  ITEM_ID\n" + 
> "   )\n" + 
> ")";
> conn.createStatement().execute(ddl3);
>  

[jira] [Created] (PHOENIX-4137) Document IndexScrutinyTool

2017-08-28 Thread James Taylor (JIRA)
James Taylor created PHOENIX-4137:
-

 Summary: Document IndexScrutinyTool
 Key: PHOENIX-4137
 URL: https://issues.apache.org/jira/browse/PHOENIX-4137
 Project: Phoenix
  Issue Type: Task
Reporter: James Taylor
Assignee: Vincent Poon


Now that PHOENIX-2460 has been committed, we need to update our website 
documentation to describe how to use it. For an overview of updating the 
website, see http://phoenix.apache.org/building_website.html. For 
IndexScrutinyTool, it's probably enough to add a section in 
https://phoenix.apache.org/secondary_indexing.html (which lives in 
./site/source/src/site/markdown/secondary_indexing.md) describing the purpose 
and possible arguments to the MR job. Something similar to the table for our 
bulk loader here: 
https://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-4137) Document IndexScrutinyTool

2017-08-28 Thread James Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Taylor updated PHOENIX-4137:
--
Fix Version/s: 4.12.0

> Document IndexScrutinyTool
> --
>
> Key: PHOENIX-4137
> URL: https://issues.apache.org/jira/browse/PHOENIX-4137
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Vincent Poon
> Fix For: 4.12.0
>
>
> Now that PHOENIX-2460 has been committed, we need to update our website 
> documentation to describe how to use it. For an overview of updating the 
> website, see http://phoenix.apache.org/building_website.html. For 
> IndexScrutinyTool, it's probably enough to add a section in 
> https://phoenix.apache.org/secondary_indexing.html (which lives in 
> ./site/source/src/site/markdown/secondary_indexing.md) describing the purpose 
> and possible arguments to the MR job. Something similar to the table for our 
> bulk loader here: 
> https://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-4136) Document APPROXIMATE_COUNT_DISTINCT function

2017-08-28 Thread James Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Taylor updated PHOENIX-4136:
--
Description: Now that PHOENIX-418 has been committed, we need to document 
this new function by including  APPROXIMATE_COUNT_DISTINCT in our list of 
functions (which lives in phoenix.csv) so that it shows up here: 
https://phoenix.apache.org/language/functions.html  (was: Now that is fixed, we 
need to document this new function by including  APPROXIMATE_COUNT_DISTINCT in 
our list of functions (which lives in phoenix.csv) so that it shows up here: 
https://phoenix.apache.org/language/functions.html)

> Document APPROXIMATE_COUNT_DISTINCT function
> 
>
> Key: PHOENIX-4136
> URL: https://issues.apache.org/jira/browse/PHOENIX-4136
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
> Fix For: 4.12.0
>
>
> Now that PHOENIX-418 has been committed, we need to document this new 
> function by including  APPROXIMATE_COUNT_DISTINCT in our list of functions 
> (which lives in phoenix.csv) so that it shows up here: 
> https://phoenix.apache.org/language/functions.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4002) Document FETCH NEXT| n ROWS from Cursor

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144494#comment-16144494
 ] 

James Taylor commented on PHOENIX-4002:
---

Ping [~gsbiju].

> Document FETCH NEXT| n ROWS from Cursor
> ---
>
> Key: PHOENIX-4002
> URL: https://issues.apache.org/jira/browse/PHOENIX-4002
> Project: Phoenix
>  Issue Type: Sub-task
>Reporter: James Taylor
>Assignee: Biju Nair
>
> Now that PHOENIX-3572 is resolved and released, we need to add documentation 
> for this new functionality on our website. For directions on how to do that, 
> see http://phoenix.apache.org/building_website.html. I'd recommend adding a 
> new top level page linked off of our Features menu that explains from a users 
> perspective how to use it, and also updating our reference grammar here 
> (which is derived from content in phoenix.csv): 
> http://phoenix.apache.org/language/index.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PHOENIX-4136) Document APPROXIMATE_COUNT_DISTINCT function

2017-08-28 Thread James Taylor (JIRA)
James Taylor created PHOENIX-4136:
-

 Summary: Document APPROXIMATE_COUNT_DISTINCT function
 Key: PHOENIX-4136
 URL: https://issues.apache.org/jira/browse/PHOENIX-4136
 Project: Phoenix
  Issue Type: Task
Reporter: James Taylor
Assignee: Ethan Wang
 Fix For: 4.12.0


Now that is fixed, we need to document this new function by including  
APPROXIMATE_COUNT_DISTINCT in our list of functions (which lives in 
phoenix.csv) so that it shows up here: 
https://phoenix.apache.org/language/functions.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PHOENIX-2460) Implement scrutiny command to validate whether or not an index is in sync with the data table

2017-08-28 Thread James Taylor (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Taylor resolved PHOENIX-2460.
---
   Resolution: Fixed
Fix Version/s: 4.12.0

Thanks for the contribution, [~vincentpoon]. I've pushed it to 4.x and master 
branches. Nice work!

> Implement scrutiny command to validate whether or not an index is in sync 
> with the data table
> -
>
> Key: PHOENIX-2460
> URL: https://issues.apache.org/jira/browse/PHOENIX-2460
> Project: Phoenix
>  Issue Type: Bug
>Reporter: James Taylor
>Assignee: Vincent Poon
> Fix For: 4.12.0
>
> Attachments: PHOENIX-2460.patch
>
>
> We should have a process that runs to verify that an index is valid against a 
> data table and potentially fixes it if discrepancies are found. This could 
> either be a MR job or a low priority background task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144482#comment-16144482
 ] 

James Taylor commented on PHOENIX-418:
--

Do you know what the difference are, [~mujtabachohan] between "ubuntu-eu2" 
versus "H4"? Different JVMs?

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144459#comment-16144459
 ] 

James Taylor commented on PHOENIX-418:
--

Thanks, [~aertoria]. I've pushed your last patch to 4.x and master branches. 
Nice work!

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Ethan Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144380#comment-16144380
 ] 

Ethan Wang commented on PHOENIX-418:


Changes has been made to pom for using 2.9.5. The flaky IT tests went away with 
this batch (v6).  
[~jamestaylor]

During the investigation, one thing observed is that both success runs happens 
when the Jenkins node "ubuntu-eu2" pick it up (versus "H4" during the weekend). 
I'm not sure if this is related. [~samarthjain]

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144378#comment-16144378
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12884107/PHOENIX-418-v6.patch
  against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790.
  ATTACHMENT ID: 12884107

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
56 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) 
FROM " + tableName + " a, " + tableName
+   final private void prepareTableWithValues(final Connection conn, final 
int nRows) throws Exception {
+   final PreparedStatement stmt = conn.prepareStatement("upsert 
into " + tableName + " VALUES (?, ?)");
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   protected final ImmutableBytesWritable valueByteArray = new 
ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY);
+public DistinctCountHyperLogLogAggregateParseNode(String name, 
List children, BuiltInFunctionInfo info) {
+public FunctionExpression create(List children, 
StatementContext context) throws SQLException {

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryIT

Test results: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//testReport/
Javadoc warnings: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//artifact/patchprocess/patchJavadocWarnings.txt
Console output: 
https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//console

This message is automatically generated.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PHOENIX-4135) CSV Bulk Load get slow in last phase of JOB (after 70% reduce completion)

2017-08-28 Thread Parakram (JIRA)
Parakram created PHOENIX-4135:
-

 Summary: CSV Bulk Load get slow in last phase of JOB (after 70% 
reduce completion)
 Key: PHOENIX-4135
 URL: https://issues.apache.org/jira/browse/PHOENIX-4135
 Project: Phoenix
  Issue Type: Wish
 Environment: When I run CSV bulkload tool It get slow in last phase 
i.e. after 70% execution for just 70 GB data.. 
I guess at the end only single reducer work on entire data to sort it.

Is there any way to optimize it?
Reporter: Parakram
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Ethan Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Wang updated PHOENIX-418:
---
Attachment: PHOENIX-418-v6.patch

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, 
> PHOENIX-418-v6.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PHOENIX-4134) Per the changes at Hbase 1.3.1, upgrade ConnectionQueryServiceImpl to use ClusterConnection rather than HConnection

2017-08-28 Thread Ethan Wang (JIRA)
Ethan Wang created PHOENIX-4134:
---

 Summary: Per the changes at Hbase 1.3.1, upgrade 
ConnectionQueryServiceImpl to use ClusterConnection rather than HConnection
 Key: PHOENIX-4134
 URL: https://issues.apache.org/jira/browse/PHOENIX-4134
 Project: Phoenix
  Issue Type: Bug
Reporter: Ethan Wang


Since hbase 1.2.4, Hconnection is deprecated by ClusterConnection. This change 
needs to be updated to Phoenix as well.

[~alexaraujo]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144178#comment-16144178
 ] 

James Taylor commented on PHOENIX-418:
--

[~aertoria]. Thanks for the updated patch. Please update the 
phoenix-core/pom.xml to use the 2.9.5 version of 
com.clearspring.analytics:stream instead of 2.7.0 if there's no reason not to 
go with the latest.

Any idea why the test runs are so flaky? I'm going to try running them locally 
too, but it seems like more than the usual suspects failing.

> Support approximate COUNT DISTINCT
> --
>
> Key: PHOENIX-418
> URL: https://issues.apache.org/jira/browse/PHOENIX-418
> Project: Phoenix
>  Issue Type: Task
>Reporter: James Taylor
>Assignee: Ethan Wang
>  Labels: gsoc2016
> Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, 
> PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch
>
>
> Support an "approximation" of count distinct to prevent having to hold on to 
> all distinct values (since this will not scale well when the number of 
> distinct values is huge). The Apache Drill folks have had some interesting 
> discussions on this 
> [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E).
>  They recommend using  [Welford's 
> method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm).
>  I'm open to having a config option that uses exact versus approximate. I 
> don't have experience implementing an approximate implementation, so I'm not 
> sure how much state is required to keep on the server and return to the 
> client (other than realizing it'd be much less that returning all distinct 
> values and their counts).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-4132) TestIndexWriter causes builds to hang sometimes

2017-08-28 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144158#comment-16144158
 ] 

James Taylor commented on PHOENIX-4132:
---

+1

> TestIndexWriter causes builds to hang sometimes
> ---
>
> Key: PHOENIX-4132
> URL: https://issues.apache.org/jira/browse/PHOENIX-4132
> Project: Phoenix
>  Issue Type: Bug
>Reporter: Samarth Jain
>Assignee: Samarth Jain
> Attachments: PHOENIX-4132_4.x-HBase-0.98.patch
>
>
> Below is the jstack of the threads:
> {code}
> "main" #1 prio=5 os_prio=31 tid=0x7fdd3f805000 nid=0x1c03 waiting on 
> condition [0x79bda000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007a00bb8b0> (a 
> com.google.common.util.concurrent.AbstractFuture$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:280)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> org.apache.phoenix.hbase.index.parallel.BaseTaskRunner.submit(BaseTaskRunner.java:66)
>   at 
> org.apache.phoenix.hbase.index.parallel.BaseTaskRunner.submitUninterruptible(BaseTaskRunner.java:99)
>   at 
> org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter.write(ParallelWriterIndexCommitter.java:197)
>   at 
> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:189)
>   at 
> org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:175)
>   at 
> org.apache.phoenix.hbase.index.write.TestIndexWriter.testFailureOnRunningUpdateAbortsPending(TestIndexWriter.java:212)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
>   at org.junit.rules.RunRules.evaluate(RunRules.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:272)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:236)
>   at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:386)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:323)
>   at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:143)
> {code}
> {code}
> "pool-8-thread-1" #25 prio=5 os_prio=31 tid=0x7fdd3ef1a000 nid=0x130b 
> waiting on condition [0x7bf44000]
>java.lang.Thread.State: WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x0007a00add50> (a 
> java.util.concurrent.CountDownLatch$Sync)
>   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>   

[jira] [Updated] (PHOENIX-4133) [hive] ColumnInfo list should be reordered and filtered refer the hive tables

2017-08-28 Thread ZhuQQ (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhuQQ updated PHOENIX-4133:
---
Description: 
In some case, we create hive tables with different order, and may not contains 
all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` 
not works well.

For example:
{code:sql}
-- In Phoenix:
CREATE TABLE IF NOT EXISTS test (
 key1 VARCHAR NOT NULL,
 key2 INTEGER NOT NULL,
 key3 VARCHAR,
 pv BIGINT,
 uv BIGINT,
 CONSTRAINT PK PRIMARY KEY (key1, key2, key3)
);
{code}
{code:sql}
-- In Hive:
CREATE EXTERNAL TABLE test.test_part (
 key1 string,
 key2 int,
 pv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2",
  "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv"
);
CREATE EXTERNAL TABLE test.test_uv (
 key1 string,
 key2 int,
 key3 string,
 app_version string,
 channel string,
 uv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2,key3",
  "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv"
);
{code}

Then insert to {{test.test_part}}:
{code:sql}
INSERT INTO test.test_part SELECT 'some key', 20170828,80;
{code}
throws error: 
{code:java}
ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR
{code}
And insert to {{test.test_uv}}:
{code:sql}
INSERT INTO test.test_uv SELECT 'some key',20170828,'linux',11;
{code}
Job executed successfully, but pv is overrided to 11 and uv is still NULL.

PS: haven't test other versions, but by check the latest source code, new 
versions may also have same problems


  was:
In some case, we create hive tables with differen order, and may not contains 
all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` 
not works well.

For example:
{code:sql}
-- In Phoenix:
CREATE TABLE IF NOT EXISTS test (
 key1 VARCHAR NOT NULL,
 key2 INTEGER NOT NULL,
 key3 VARCHAR,
 pv BIGINT,
 uv BIGINT,
 CONSTRAINT PK PRIMARY KEY (key1, key2, key3)
);
{code}
{code:sql}
-- In Hive:
CREATE EXTERNAL TABLE test.test_part (
 key1 string,
 key2 int,
 pv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2",
  "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv"
);
CREATE EXTERNAL TABLE test.test_uv (
 key1 string,
 key2 int,
 key3 string,
 app_version string,
 channel string,
 uv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2,key3",
  "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv"
);
{code}

Then insert to {{test.test_part}}:
{code:sql}
INSERT INTO test.test_part SELECT 'some key', 20170828,80;
{code}
throws error: 
{code:java}
ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR
{code}
And insert to {{test.test_uv}}:
{code:sql}
INSERT INTO test.test_uv SELECT 'some key',20170828,'linux',11;
{code}
Job executed successfully, but pv is overrided to 11 and uv is still NULL.

PS: not test other versions, but by check the latest source code, new versions 
may also have same problems



> [hive] ColumnInfo list should be reordered and filtered refer the hive tables
> -
>
> Key: PHOENIX-4133
> URL: https://issues.apache.org/jira/browse/PHOENIX-4133
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.9.0
>Reporter: ZhuQQ
>
> In some case, we create hive tables with different order, and may

[jira] [Updated] (PHOENIX-4133) [hive] ColumnInfo list should be reordered and filtered refer the hive tables

2017-08-28 Thread ZhuQQ (JIRA)

 [ 
https://issues.apache.org/jira/browse/PHOENIX-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhuQQ updated PHOENIX-4133:
---
Description: 
In some case, we create hive tables with differen order, and may not contains 
all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` 
not works well.

For example:
{code:sql}
-- In Phoenix:
CREATE TABLE IF NOT EXISTS test (
 key1 VARCHAR NOT NULL,
 key2 INTEGER NOT NULL,
 key3 VARCHAR,
 pv BIGINT,
 uv BIGINT,
 CONSTRAINT PK PRIMARY KEY (key1, key2, key3)
);
{code}
{code:sql}
-- In Hive:
CREATE EXTERNAL TABLE test.test_part (
 key1 string,
 key2 int,
 pv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2",
  "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv"
);
CREATE EXTERNAL TABLE test.test_uv (
 key1 string,
 key2 int,
 key3 string,
 app_version string,
 channel string,
 uv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2,key3",
  "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv"
);
{code}

Then insert to {{test.test_part}}:
{code:sql}
INSERT INTO test.test_part SELECT 'some key', 20170828,80;
{code}
throws error: 
{code:java}
ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR
{code}
And insert to {{test.test_uv}}:
{code:sql}
INSERT INTO test.test_uv SELECT 'some key',20170828,'linux',11;
{code}
Job executed successfully, but pv is overrided to 11 and uv is still NULL.

PS: not test other versions, but by check the latest source code, new versions 
may also have same problems


  was:
In some case, we create hive tables with differen order, and may not contains 
all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` 
not works well.

For example:
{code:sql}
-- In Phoenix:
CREATE TABLE IF NOT EXISTS test (
 key1 VARCHAR NOT NULL,
 key2 INTEGER NOT NULL,
 key3 VARCHAR,
 pv BIGINT,
 uv BIGINT,
 CONSTRAINT PK PRIMARY KEY (key1, key2, key3)
);
{code}
{code:sql}
-- In Hive:
CREATE EXTERNAL TABLE test.test_part (
 key1 string,
 key2 int,
 pv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2",
  "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv"
);
CREATE EXTERNAL TABLE test.test_uv (
 key1 string,
 key2 int,
 key3 string,
 app_version string,
 channel string,
 uv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2,key3",
  "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv"
);
{code}

Then insert to {{test.test_part}}:
{code:sql}
INSERT INTO test.test_part SELECT 'some key', 20170828,80;
{code}
throws error: 
{code:java}
ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR
{code}
And insert to {{test.test_uv}}:
{code:sql}
INSERT INTO test.test_uv SELECT 'some key', 20170828,'linux',11;
{code}
Job executed successfully, but pv is overrided to 11 and uv is still NULL.



> [hive] ColumnInfo list should be reordered and filtered refer the hive tables
> -
>
> Key: PHOENIX-4133
> URL: https://issues.apache.org/jira/browse/PHOENIX-4133
> Project: Phoenix
>  Issue Type: Bug
>Affects Versions: 4.9.0
>Reporter: ZhuQQ
>
> In some case, we create hive tables with differen order, and may not contains 
> all columns in the phoenix tables, then we found `INSERT INTO test SELECT 
> ...` not works well.
> For exam

[jira] [Created] (PHOENIX-4133) [hive] ColumnInfo list should be reordered and filtered refer the hive tables

2017-08-28 Thread ZhuQQ (JIRA)
ZhuQQ created PHOENIX-4133:
--

 Summary: [hive] ColumnInfo list should be reordered and filtered 
refer the hive tables
 Key: PHOENIX-4133
 URL: https://issues.apache.org/jira/browse/PHOENIX-4133
 Project: Phoenix
  Issue Type: Bug
Affects Versions: 4.9.0
Reporter: ZhuQQ


In some case, we create hive tables with differen order, and may not contains 
all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` 
not works well.

For example:
{code:sql}
-- In Phoenix:
CREATE TABLE IF NOT EXISTS test (
 key1 VARCHAR NOT NULL,
 key2 INTEGER NOT NULL,
 key3 VARCHAR,
 pv BIGINT,
 uv BIGINT,
 CONSTRAINT PK PRIMARY KEY (key1, key2, key3)
);
{code}
{code:sql}
-- In Hive:
CREATE EXTERNAL TABLE test.test_part (
 key1 string,
 key2 int,
 pv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2",
  "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv"
);
CREATE EXTERNAL TABLE test.test_uv (
 key1 string,
 key2 int,
 key3 string,
 app_version string,
 channel string,
 uv bigint
)
STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
TBLPROPERTIES (
  "phoenix.table.name" = "test",
  "phoenix.zookeeper.quorum" = "localhost",
  "phoenix.zookeeper.znode.parent" = "/hbase",
  "phoenix.zookeeper.client.port" = "2181",
  "phoenix.rowkeys" = "key1,key2,key3",
  "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv"
);
{code}

Then insert to {{test.test_part}}:
{code:sql}
INSERT INTO test.test_part SELECT 'some key', 20170828,80;
{code}
throws error: 
{code:java}
ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR
{code}
And insert to {{test.test_uv}}:
{code:sql}
INSERT INTO test.test_uv SELECT 'some key', 20170828,'linux',11;
{code}
Job executed successfully, but pv is overrided to 11 and uv is still NULL.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT

2017-08-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143482#comment-16143482
 ] 

Hadoop QA commented on PHOENIX-418:
---

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12883988/PHOENIX-418-v5.patch
  against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790.
  ATTACHMENT ID: 12883988

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
56 warning messages.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces the following lines 
longer than 100:
+   String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) 
FROM " + tableName + " a, " + tableName
+   final private void prepareTableWithValues(final Connection conn, final 
int nRows) throws Exception {
+   final PreparedStatement stmt = conn.prepareStatement("upsert 
into " + tableName + " VALUES (?, ?)");
+@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, 
nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} 
)
+public DistinctCountHyperLogLogAggregateFunction(List 
childExpressions, CountAggregateFunction delegate){
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   private HyperLogLogPlus hll = new 
HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, 
DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision);
+   protected final ImmutableBytesWritable valueByteArray = new 
ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY);
+public DistinctCountHyperLogLogAggregateParseNode(String name, 
List children, BuiltInFunctionInfo info) {
+public FunctionExpression create(List children, 
StatementContext context) throws SQLException {

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
 
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT
./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CsvBulkLoadToolIT