[jira] [Comment Edited] (PHOENIX-3999) Optimize inner joins as SKIP-SCAN-JOIN when possible
[ https://issues.apache.org/jira/browse/PHOENIX-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144780#comment-16144780 ] Maryann Xue edited comment on PHOENIX-3999 at 8/29/17 5:53 AM: --- The reason why hash cache is used in this query is that there is "b.BATCH_SEQUENCE_NUM" in the SELECT clause, so we have to do the actual join operation for referenced fields (in this case, only 1 field) from RHS. Although, in some sense the join is driven "by the client side" through skip-scan filter, only that we cannot omit the join operation as we would for semi-joins. So my point is, for this query, the best has been done. was (Author: maryannxue): The reason why hash cache is used in this query is that there is "b.BATCH_SEQUENCE_NUM" in the SELECT clause, so we have to do the actual join operation for referenced fields (in this case, only 1 field) from RHS. Although, in some sense the join is driven "by the client side" through skip-scan filter, only that we cannot omit the join operation as we would for semi-joins. > Optimize inner joins as SKIP-SCAN-JOIN when possible > > > Key: PHOENIX-3999 > URL: https://issues.apache.org/jira/browse/PHOENIX-3999 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor > > Semi joins on the leading part of the primary key end up doing batches of > point queries (as opposed to a broadcast hash join), however inner joins do > not. > Here's a set of example schemas that executes a skip scan on the inner query: > {code} > CREATE TABLE COMPLETED_BATCHES ( > BATCH_SEQUENCE_NUM BIGINT NOT NULL, > BATCH_ID BIGINT NOT NULL, > CONSTRAINT PK PRIMARY KEY > ( > BATCH_SEQUENCE_NUM, > BATCH_ID > ) > ); > CREATE TABLE ITEMS ( >BATCH_ID BIGINT NOT NULL, >ITEM_ID BIGINT NOT NULL, >ITEM_TYPE BIGINT, >ITEM_VALUE VARCHAR, >CONSTRAINT PK PRIMARY KEY >( > BATCH_ID, > ITEM_ID >) > ); > CREATE TABLE COMPLETED_ITEMS ( >ITEM_TYPE BIGINT NOT NULL, >BATCH_SEQUENCE_NUM BIGINT NOT NULL, >ITEM_IDBIGINT NOT NULL, >ITEM_VALUE VARCHAR, >CONSTRAINT PK PRIMARY KEY >( > ITEM_TYPE, > BATCH_SEQUENCE_NUM, > ITEM_ID >) > ); > {code} > The explain plan of these indicate that a dynamic filter will be performed > like this: > {code} > UPSERT SELECT > CLIENT PARALLEL 1-WAY FULL SCAN OVER ITEMS > SKIP-SCAN-JOIN TABLE 0 > CLIENT PARALLEL 1-WAY RANGE SCAN OVER COMPLETED_BATCHES [1] - [2] > SERVER FILTER BY FIRST KEY ONLY > SERVER AGGREGATE INTO DISTINCT ROWS BY [BATCH_ID] > CLIENT MERGE SORT > DYNAMIC SERVER FILTER BY I.BATCH_ID IN ($8.$9) > {code} > We should also be able to leverage this optimization when an inner join is > used such as this: > {code} > UPSERT INTO COMPLETED_ITEMS (ITEM_TYPE, BATCH_SEQUENCE_NUM, ITEM_ID, > ITEM_VALUE) >SELECT i.ITEM_TYPE, b.BATCH_SEQUENCE_NUM, i.ITEM_ID, i.ITEM_VALUE >FROM ITEMS i, COMPLETED_BATCHES b >WHERE b.BATCH_ID = i.BATCH_ID AND >b.BATCH_SEQUENCE_NUM > 1000 AND b.BATCH_SEQUENCE_NUM < 2000; > {code} > A complete unit test looks like this: > {code} > @Test > public void testNestedLoopJoin() throws Exception { > try (Connection conn = DriverManager.getConnection(getUrl())) { > String t1="COMPLETED_BATCHES"; > String ddl1 = "CREATE TABLE " + t1 + " (\n" + > "BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + > "BATCH_ID BIGINT NOT NULL,\n" + > "CONSTRAINT PK PRIMARY KEY\n" + > "(\n" + > "BATCH_SEQUENCE_NUM,\n" + > "BATCH_ID\n" + > ")\n" + > ")" + > ""; > conn.createStatement().execute(ddl1); > > String t2="ITEMS"; > String ddl2 = "CREATE TABLE " + t2 + " (\n" + > " BATCH_ID BIGINT NOT NULL,\n" + > " ITEM_ID BIGINT NOT NULL,\n" + > " ITEM_TYPE BIGINT,\n" + > " ITEM_VALUE VARCHAR,\n" + > " CONSTRAINT PK PRIMARY KEY\n" + > " (\n" + > "BATCH_ID,\n" + > "ITEM_ID\n" + > " )\n" + > ")"; > conn.createStatement().execute(ddl2); > String t3="COMPLETED_ITEMS"; > String ddl3 = "CREATE TABLE " + t3 + "(\n" + > " ITEM_TYPE BIGINT NOT NULL,\n" + > " BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + >
[jira] [Commented] (PHOENIX-3999) Optimize inner joins as SKIP-SCAN-JOIN when possible
[ https://issues.apache.org/jira/browse/PHOENIX-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144780#comment-16144780 ] Maryann Xue commented on PHOENIX-3999: -- The reason why hash cache is used in this query is that there is "b.BATCH_SEQUENCE_NUM" in the SELECT clause, so we have to do the actual join operation for referenced fields (in this case, only 1 field) from RHS. Although, in some sense the join is driven "by the client side" through skip-scan filter, only that we cannot omit the join operation as we would for semi-joins. > Optimize inner joins as SKIP-SCAN-JOIN when possible > > > Key: PHOENIX-3999 > URL: https://issues.apache.org/jira/browse/PHOENIX-3999 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor > > Semi joins on the leading part of the primary key end up doing batches of > point queries (as opposed to a broadcast hash join), however inner joins do > not. > Here's a set of example schemas that executes a skip scan on the inner query: > {code} > CREATE TABLE COMPLETED_BATCHES ( > BATCH_SEQUENCE_NUM BIGINT NOT NULL, > BATCH_ID BIGINT NOT NULL, > CONSTRAINT PK PRIMARY KEY > ( > BATCH_SEQUENCE_NUM, > BATCH_ID > ) > ); > CREATE TABLE ITEMS ( >BATCH_ID BIGINT NOT NULL, >ITEM_ID BIGINT NOT NULL, >ITEM_TYPE BIGINT, >ITEM_VALUE VARCHAR, >CONSTRAINT PK PRIMARY KEY >( > BATCH_ID, > ITEM_ID >) > ); > CREATE TABLE COMPLETED_ITEMS ( >ITEM_TYPE BIGINT NOT NULL, >BATCH_SEQUENCE_NUM BIGINT NOT NULL, >ITEM_IDBIGINT NOT NULL, >ITEM_VALUE VARCHAR, >CONSTRAINT PK PRIMARY KEY >( > ITEM_TYPE, > BATCH_SEQUENCE_NUM, > ITEM_ID >) > ); > {code} > The explain plan of these indicate that a dynamic filter will be performed > like this: > {code} > UPSERT SELECT > CLIENT PARALLEL 1-WAY FULL SCAN OVER ITEMS > SKIP-SCAN-JOIN TABLE 0 > CLIENT PARALLEL 1-WAY RANGE SCAN OVER COMPLETED_BATCHES [1] - [2] > SERVER FILTER BY FIRST KEY ONLY > SERVER AGGREGATE INTO DISTINCT ROWS BY [BATCH_ID] > CLIENT MERGE SORT > DYNAMIC SERVER FILTER BY I.BATCH_ID IN ($8.$9) > {code} > We should also be able to leverage this optimization when an inner join is > used such as this: > {code} > UPSERT INTO COMPLETED_ITEMS (ITEM_TYPE, BATCH_SEQUENCE_NUM, ITEM_ID, > ITEM_VALUE) >SELECT i.ITEM_TYPE, b.BATCH_SEQUENCE_NUM, i.ITEM_ID, i.ITEM_VALUE >FROM ITEMS i, COMPLETED_BATCHES b >WHERE b.BATCH_ID = i.BATCH_ID AND >b.BATCH_SEQUENCE_NUM > 1000 AND b.BATCH_SEQUENCE_NUM < 2000; > {code} > A complete unit test looks like this: > {code} > @Test > public void testNestedLoopJoin() throws Exception { > try (Connection conn = DriverManager.getConnection(getUrl())) { > String t1="COMPLETED_BATCHES"; > String ddl1 = "CREATE TABLE " + t1 + " (\n" + > "BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + > "BATCH_ID BIGINT NOT NULL,\n" + > "CONSTRAINT PK PRIMARY KEY\n" + > "(\n" + > "BATCH_SEQUENCE_NUM,\n" + > "BATCH_ID\n" + > ")\n" + > ")" + > ""; > conn.createStatement().execute(ddl1); > > String t2="ITEMS"; > String ddl2 = "CREATE TABLE " + t2 + " (\n" + > " BATCH_ID BIGINT NOT NULL,\n" + > " ITEM_ID BIGINT NOT NULL,\n" + > " ITEM_TYPE BIGINT,\n" + > " ITEM_VALUE VARCHAR,\n" + > " CONSTRAINT PK PRIMARY KEY\n" + > " (\n" + > "BATCH_ID,\n" + > "ITEM_ID\n" + > " )\n" + > ")"; > conn.createStatement().execute(ddl2); > String t3="COMPLETED_ITEMS"; > String ddl3 = "CREATE TABLE " + t3 + "(\n" + > " ITEM_TYPE BIGINT NOT NULL,\n" + > " BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + > " ITEM_IDBIGINT NOT NULL,\n" + > " ITEM_VALUE VARCHAR,\n" + > " CONSTRAINT PK PRIMARY KEY\n" + > " (\n" + > " ITEM_TYPE,\n" + > " BATCH_SEQUENCE_NUM, \n" + > " ITEM_ID\n" + > " )\n" + > ")"; > conn.createStatement().execute(ddl3); >
[jira] [Created] (PHOENIX-4138) Create a hard limit on number of indexes per table
Rahul Shrivastava created PHOENIX-4138: -- Summary: Create a hard limit on number of indexes per table Key: PHOENIX-4138 URL: https://issues.apache.org/jira/browse/PHOENIX-4138 Project: Phoenix Issue Type: Bug Reporter: Rahul Shrivastava There should be a config parameter to impose a hard limit on number of indexes per table. There is a SQL Exception https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/exception/SQLExceptionCode.java#L260 , but it gets triggered on the server side (https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/coprocessor/MetaDataEndpointImpl.java#L1589) . We need a client side limit that can be configured via Phoenix config parameter. Something like if user create more than lets say 30 indexes per table, it would not allow more index creation for the that specific table. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Wang updated PHOENIX-418: --- Description: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Update: Syntax of using approximate count distinct as: select APPROX_COUNT_DISTINCT(name) from person select APPROX_COUNT_DISTINCT(address||name) from person It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using hyperloglog, see discuss below. Source code patch link below, co-authorred with [~swapna] https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 was: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Update: Syntax of using approximate count distinct as: select APPROX_COUNT_DISTINCT(name) from person select APPROX_COUNT_DISTINCT(address||name) from person It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using hyperloglog, see discuss below. Merged patch link below, co-authorred with [~swapna] https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Source code patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-153) Implement TABLESAMPLE clause
[ https://issues.apache.org/jira/browse/PHOENIX-153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Wang updated PHOENIX-153: --- Description: Support the standard SQL TABLESAMPLE clause by implementing a filter that uses a skip next hint based on the region boundaries of the table to only return n rows per region. [Update] Source Code Patch: https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25 was: Support the standard SQL TABLESAMPLE clause by implementing a filter that uses a skip next hint based on the region boundaries of the table to only return n rows per region. [Update] Patch: https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25 > Implement TABLESAMPLE clause > > > Key: PHOENIX-153 > URL: https://issues.apache.org/jira/browse/PHOENIX-153 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: enhancement > Fix For: 4.12.0 > > Attachments: Sampling_Accuracy_Performance.jpg > > > Support the standard SQL TABLESAMPLE clause by implementing a filter that > uses a skip next hint based on the region boundaries of the table to only > return n rows per region. > [Update] > Source Code Patch: > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=5e33dc12bc088bd0008d89f0a5cd7d5c368efa25 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-3815) Only disable indexes on which write failures occurred
[ https://issues.apache.org/jira/browse/PHOENIX-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144741#comment-16144741 ] James Taylor commented on PHOENIX-3815: --- I think we should just get rid of ParallelWriterIndexCommitter completely. It'll get rid of a bunch of code duplication and we need to know on which index the write failed anyway. Do you think we'll take a perf hit and if so anything we can do to improve TrackingParallelWriterIndexCommitter? > Only disable indexes on which write failures occurred > - > > Key: PHOENIX-3815 > URL: https://issues.apache.org/jira/browse/PHOENIX-3815 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor >Assignee: Vincent Poon > Fix For: 4.12.0 > > Attachments: PHOENIX-3815.v1.patch > > > We currently disable all indexes if any of them fail to be written to. We > really only should disable the one in which the write failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-2460) Implement scrutiny command to validate whether or not an index is in sync with the data table
[ https://issues.apache.org/jira/browse/PHOENIX-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144711#comment-16144711 ] Hudson commented on PHOENIX-2460: - FAILURE: Integrated in Jenkins build Phoenix-master #1754 (See [https://builds.apache.org/job/Phoenix-master/1754/]) PHOENIX-2460 Implement scrutiny command to validate whether or not an (jtaylor: rev fc659488361c91b569f15a26dcbab5cbb24c276b) * (edit) phoenix-core/src/main/java/org/apache/phoenix/util/QueryUtil.java * (add) phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/IndexScrutinyTool.java * (add) phoenix-core/src/test/java/org/apache/phoenix/mapreduce/util/IndexColumnNamesTest.java * (add) phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/SourceTargetColumnNames.java * (add) phoenix-core/src/it/java/org/apache/phoenix/end2end/IndexScrutinyToolIT.java * (add) phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/PhoenixScrutinyJobCounters.java * (edit) phoenix-core/src/main/java/org/apache/phoenix/util/SchemaUtil.java * (edit) phoenix-core/src/main/java/org/apache/phoenix/mapreduce/util/PhoenixConfigurationUtil.java * (add) phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/IndexScrutinyTableOutput.java * (edit) phoenix-core/src/test/java/org/apache/phoenix/util/QueryUtilTest.java * (add) phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/IndexScrutinyMapper.java * (add) phoenix-core/src/test/java/org/apache/phoenix/mapreduce/index/IndexScrutinyTableOutputTest.java * (add) phoenix-core/src/test/java/org/apache/phoenix/mapreduce/index/BaseIndexTest.java * (edit) phoenix-core/src/main/java/org/apache/phoenix/mapreduce/index/PhoenixIndexDBWritable.java * (add) phoenix-core/src/main/java/org/apache/phoenix/mapreduce/util/IndexColumnNames.java > Implement scrutiny command to validate whether or not an index is in sync > with the data table > - > > Key: PHOENIX-2460 > URL: https://issues.apache.org/jira/browse/PHOENIX-2460 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor >Assignee: Vincent Poon > Fix For: 4.12.0 > > Attachments: PHOENIX-2460.patch > > > We should have a process that runs to verify that an index is valid against a > data table and potentially fixes it if discrepancies are found. This could > either be a MR job or a low priority background task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144645#comment-16144645 ] Hudson commented on PHOENIX-418: FAILURE: Integrated in Jenkins build Phoenix-master #1753 (See [https://builds.apache.org/job/Phoenix-master/1753/]) PHOENIX-418 Support approximate COUNT DISTINCT (Ethan Wang) (jtaylor: rev d6381afc3af976ccdbb874d4458ea17b1e8a1d32) * (add) phoenix-core/src/main/java/org/apache/phoenix/expression/function/DistinctCountHyperLogLogAggregateFunction.java * (add) phoenix-core/src/it/java/org/apache/phoenix/end2end/CountDistinctApproximateHyperLogLogIT.java * (edit) dev/release_files/NOTICE * (edit) phoenix-core/pom.xml * (edit) phoenix-core/src/main/java/org/apache/phoenix/expression/ExpressionType.java * (add) phoenix-core/src/main/java/org/apache/phoenix/parse/DistinctCountHyperLogLogAggregateParseNode.java > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Merged patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-3815) Only disable indexes on which write failures occurred
[ https://issues.apache.org/jira/browse/PHOENIX-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144631#comment-16144631 ] Vincent Poon commented on PHOENIX-3815: --- [~jamestaylor] Can you review? Note this requires TrackingParallelWriterIndexCommitter to be set via config, unless we change it to make it the default > Only disable indexes on which write failures occurred > - > > Key: PHOENIX-3815 > URL: https://issues.apache.org/jira/browse/PHOENIX-3815 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor >Assignee: Vincent Poon > Fix For: 4.12.0 > > Attachments: PHOENIX-3815.v1.patch > > > We currently disable all indexes if any of them fail to be written to. We > really only should disable the one in which the write failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-3815) Only disable indexes on which write failures occurred
[ https://issues.apache.org/jira/browse/PHOENIX-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Poon updated PHOENIX-3815: -- Attachment: PHOENIX-3815.v1.patch > Only disable indexes on which write failures occurred > - > > Key: PHOENIX-3815 > URL: https://issues.apache.org/jira/browse/PHOENIX-3815 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor >Assignee: Vincent Poon > Fix For: 4.12.0 > > Attachments: PHOENIX-3815.v1.patch > > > We currently disable all indexes if any of them fail to be written to. We > really only should disable the one in which the write failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (PHOENIX-3815) Only disable indexes on which write failures occurred
[ https://issues.apache.org/jira/browse/PHOENIX-3815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vincent Poon reassigned PHOENIX-3815: - Assignee: Vincent Poon (was: James Taylor) > Only disable indexes on which write failures occurred > - > > Key: PHOENIX-3815 > URL: https://issues.apache.org/jira/browse/PHOENIX-3815 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor >Assignee: Vincent Poon > Fix For: 4.12.0 > > > We currently disable all indexes if any of them fail to be written to. We > really only should disable the one in which the write failed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4080) The error message for version mismatch is not accurate.
[ https://issues.apache.org/jira/browse/PHOENIX-4080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144550#comment-16144550 ] Ethan Wang commented on PHOENIX-4080: - [~gjacoby] Yes I have re-run the ITs in my local and all pass. I also reviewed the IT errors in the apache jenkins error, didn't looks like to me related. Thanks for following up. > The error message for version mismatch is not accurate. > --- > > Key: PHOENIX-4080 > URL: https://issues.apache.org/jira/browse/PHOENIX-4080 > Project: Phoenix > Issue Type: Wish >Affects Versions: 4.11.0 >Reporter: Ethan Wang >Assignee: Ethan Wang > Attachments: PHOENIX-4080.patch, PHOENIX-4080-v2.patch > > > When accessing a 4.10 running cluster with 4.11 client, it referred as > The following servers require an updated phoenix.jar to be put in the > classpath of HBase: region=SYSTEM.CATALOG > It should be phoenix-[version]-server.jar rather than phoenix.jar -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Wang updated PHOENIX-418: --- Description: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Update: Syntax of using approximate count distinct as: select APPROX_COUNT_DISTINCT(name) from person select APPROX_COUNT_DISTINCT(address||name) from person It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using hyperloglog, see discuss below. Merged patch link below, co-authorred with [~swapna] https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 was: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Update: Syntax of using approximate count distinct as: select APPROX_COUNT_DISTINCT(name) from person select APPROX_COUNT_DISTINCT(address||name) from person It is equivalent of Select COUNT(DISTINCT ID) from person. But with much smaller memory foot print. Implemented using hyperloglog. Merged patch link below, co-authorred with [~swapna] https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. Implemented using > hyperloglog, see discuss below. > Merged patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Wang updated PHOENIX-418: --- Description: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Update: Syntax of using approximate count distinct as: select APPROX_COUNT_DISTINCT(name) from person select APPROX_COUNT_DISTINCT(address||name) from person It is equivalent of Select COUNT(DISTINCT ID) from person. But with much smaller memory foot print. Implemented using hyperloglog. Merged patch link below, co-authorred with [~swapna] https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 was: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Updated: Syntax of using approximate count distinct as: Select COUNT(DISTINCT ID) from person Select APPROX_COUNT_DISTINCT(ID) from person https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Update: > Syntax of using approximate count distinct as: > select APPROX_COUNT_DISTINCT(name) from person > select APPROX_COUNT_DISTINCT(address||name) from person > It is equivalent of Select COUNT(DISTINCT ID) from person. But with much > smaller memory foot print. Implemented using hyperloglog. > Merged patch link below, co-authorred with [~swapna] > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Wang updated PHOENIX-418: --- Description: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Updated: Syntax of using approximate count distinct as: Select COUNT(DISTINCT ID) from person Select APPROX_COUNT_DISTINCT(ID) from person https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 was: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Updated: Syntax of using approximate count distinct as: https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Updated: > Syntax of using approximate count distinct as: > Select COUNT(DISTINCT ID) from person > Select APPROX_COUNT_DISTINCT(ID) from person > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Wang updated PHOENIX-418: --- Description: Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). Updated: Syntax of using approximate count distinct as: https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 was:Support an "approximation" of count distinct to prevent having to hold on to all distinct values (since this will not scale well when the number of distinct values is huge). The Apache Drill folks have had some interesting discussions on this [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). They recommend using [Welford's method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). I'm open to having a config option that uses exact versus approximate. I don't have experience implementing an approximate implementation, so I'm not sure how much state is required to keep on the server and return to the client (other than realizing it'd be much less that returning all distinct values and their counts). > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). > Updated: > Syntax of using approximate count distinct as: > https://git-wip-us.apache.org/repos/asf?p=phoenix.git;a=commitdiff;h=d6381afc3af976ccdbb874d4458ea17b1e8a1d32 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-3999) Optimize inner joins as SKIP-SCAN-JOIN when possible
[ https://issues.apache.org/jira/browse/PHOENIX-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144542#comment-16144542 ] James Taylor commented on PHOENIX-3999: --- bq. only during inner join: ITEMS table as parent will receive a HashCacheImpl from RHS in order to look up the b.BATCH_SEQUENCE_NUM It's good that a skip scan is used during the scan of ITEMS. However, rather than scan the RHS and broadcast it, it'd be more efficient to take the same approach as with the semi join and drive the join on the client side as the LHS is being scanned. WDYT, [~maryannxue]? > Optimize inner joins as SKIP-SCAN-JOIN when possible > > > Key: PHOENIX-3999 > URL: https://issues.apache.org/jira/browse/PHOENIX-3999 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor > > Semi joins on the leading part of the primary key end up doing batches of > point queries (as opposed to a broadcast hash join), however inner joins do > not. > Here's a set of example schemas that executes a skip scan on the inner query: > {code} > CREATE TABLE COMPLETED_BATCHES ( > BATCH_SEQUENCE_NUM BIGINT NOT NULL, > BATCH_ID BIGINT NOT NULL, > CONSTRAINT PK PRIMARY KEY > ( > BATCH_SEQUENCE_NUM, > BATCH_ID > ) > ); > CREATE TABLE ITEMS ( >BATCH_ID BIGINT NOT NULL, >ITEM_ID BIGINT NOT NULL, >ITEM_TYPE BIGINT, >ITEM_VALUE VARCHAR, >CONSTRAINT PK PRIMARY KEY >( > BATCH_ID, > ITEM_ID >) > ); > CREATE TABLE COMPLETED_ITEMS ( >ITEM_TYPE BIGINT NOT NULL, >BATCH_SEQUENCE_NUM BIGINT NOT NULL, >ITEM_IDBIGINT NOT NULL, >ITEM_VALUE VARCHAR, >CONSTRAINT PK PRIMARY KEY >( > ITEM_TYPE, > BATCH_SEQUENCE_NUM, > ITEM_ID >) > ); > {code} > The explain plan of these indicate that a dynamic filter will be performed > like this: > {code} > UPSERT SELECT > CLIENT PARALLEL 1-WAY FULL SCAN OVER ITEMS > SKIP-SCAN-JOIN TABLE 0 > CLIENT PARALLEL 1-WAY RANGE SCAN OVER COMPLETED_BATCHES [1] - [2] > SERVER FILTER BY FIRST KEY ONLY > SERVER AGGREGATE INTO DISTINCT ROWS BY [BATCH_ID] > CLIENT MERGE SORT > DYNAMIC SERVER FILTER BY I.BATCH_ID IN ($8.$9) > {code} > We should also be able to leverage this optimization when an inner join is > used such as this: > {code} > UPSERT INTO COMPLETED_ITEMS (ITEM_TYPE, BATCH_SEQUENCE_NUM, ITEM_ID, > ITEM_VALUE) >SELECT i.ITEM_TYPE, b.BATCH_SEQUENCE_NUM, i.ITEM_ID, i.ITEM_VALUE >FROM ITEMS i, COMPLETED_BATCHES b >WHERE b.BATCH_ID = i.BATCH_ID AND >b.BATCH_SEQUENCE_NUM > 1000 AND b.BATCH_SEQUENCE_NUM < 2000; > {code} > A complete unit test looks like this: > {code} > @Test > public void testNestedLoopJoin() throws Exception { > try (Connection conn = DriverManager.getConnection(getUrl())) { > String t1="COMPLETED_BATCHES"; > String ddl1 = "CREATE TABLE " + t1 + " (\n" + > "BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + > "BATCH_ID BIGINT NOT NULL,\n" + > "CONSTRAINT PK PRIMARY KEY\n" + > "(\n" + > "BATCH_SEQUENCE_NUM,\n" + > "BATCH_ID\n" + > ")\n" + > ")" + > ""; > conn.createStatement().execute(ddl1); > > String t2="ITEMS"; > String ddl2 = "CREATE TABLE " + t2 + " (\n" + > " BATCH_ID BIGINT NOT NULL,\n" + > " ITEM_ID BIGINT NOT NULL,\n" + > " ITEM_TYPE BIGINT,\n" + > " ITEM_VALUE VARCHAR,\n" + > " CONSTRAINT PK PRIMARY KEY\n" + > " (\n" + > "BATCH_ID,\n" + > "ITEM_ID\n" + > " )\n" + > ")"; > conn.createStatement().execute(ddl2); > String t3="COMPLETED_ITEMS"; > String ddl3 = "CREATE TABLE " + t3 + "(\n" + > " ITEM_TYPE BIGINT NOT NULL,\n" + > " BATCH_SEQUENCE_NUM BIGINT NOT NULL,\n" + > " ITEM_IDBIGINT NOT NULL,\n" + > " ITEM_VALUE VARCHAR,\n" + > " CONSTRAINT PK PRIMARY KEY\n" + > " (\n" + > " ITEM_TYPE,\n" + > " BATCH_SEQUENCE_NUM, \n" + > " ITEM_ID\n" + > " )\n" + > ")"; > conn.createStatement().execute(ddl3); >
[jira] [Created] (PHOENIX-4137) Document IndexScrutinyTool
James Taylor created PHOENIX-4137: - Summary: Document IndexScrutinyTool Key: PHOENIX-4137 URL: https://issues.apache.org/jira/browse/PHOENIX-4137 Project: Phoenix Issue Type: Task Reporter: James Taylor Assignee: Vincent Poon Now that PHOENIX-2460 has been committed, we need to update our website documentation to describe how to use it. For an overview of updating the website, see http://phoenix.apache.org/building_website.html. For IndexScrutinyTool, it's probably enough to add a section in https://phoenix.apache.org/secondary_indexing.html (which lives in ./site/source/src/site/markdown/secondary_indexing.md) describing the purpose and possible arguments to the MR job. Something similar to the table for our bulk loader here: https://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-4137) Document IndexScrutinyTool
[ https://issues.apache.org/jira/browse/PHOENIX-4137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Taylor updated PHOENIX-4137: -- Fix Version/s: 4.12.0 > Document IndexScrutinyTool > -- > > Key: PHOENIX-4137 > URL: https://issues.apache.org/jira/browse/PHOENIX-4137 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Vincent Poon > Fix For: 4.12.0 > > > Now that PHOENIX-2460 has been committed, we need to update our website > documentation to describe how to use it. For an overview of updating the > website, see http://phoenix.apache.org/building_website.html. For > IndexScrutinyTool, it's probably enough to add a section in > https://phoenix.apache.org/secondary_indexing.html (which lives in > ./site/source/src/site/markdown/secondary_indexing.md) describing the purpose > and possible arguments to the MR job. Something similar to the table for our > bulk loader here: > https://phoenix.apache.org/bulk_dataload.html#Loading_via_MapReduce. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-4136) Document APPROXIMATE_COUNT_DISTINCT function
[ https://issues.apache.org/jira/browse/PHOENIX-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Taylor updated PHOENIX-4136: -- Description: Now that PHOENIX-418 has been committed, we need to document this new function by including APPROXIMATE_COUNT_DISTINCT in our list of functions (which lives in phoenix.csv) so that it shows up here: https://phoenix.apache.org/language/functions.html (was: Now that is fixed, we need to document this new function by including APPROXIMATE_COUNT_DISTINCT in our list of functions (which lives in phoenix.csv) so that it shows up here: https://phoenix.apache.org/language/functions.html) > Document APPROXIMATE_COUNT_DISTINCT function > > > Key: PHOENIX-4136 > URL: https://issues.apache.org/jira/browse/PHOENIX-4136 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Fix For: 4.12.0 > > > Now that PHOENIX-418 has been committed, we need to document this new > function by including APPROXIMATE_COUNT_DISTINCT in our list of functions > (which lives in phoenix.csv) so that it shows up here: > https://phoenix.apache.org/language/functions.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4002) Document FETCH NEXT| n ROWS from Cursor
[ https://issues.apache.org/jira/browse/PHOENIX-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144494#comment-16144494 ] James Taylor commented on PHOENIX-4002: --- Ping [~gsbiju]. > Document FETCH NEXT| n ROWS from Cursor > --- > > Key: PHOENIX-4002 > URL: https://issues.apache.org/jira/browse/PHOENIX-4002 > Project: Phoenix > Issue Type: Sub-task >Reporter: James Taylor >Assignee: Biju Nair > > Now that PHOENIX-3572 is resolved and released, we need to add documentation > for this new functionality on our website. For directions on how to do that, > see http://phoenix.apache.org/building_website.html. I'd recommend adding a > new top level page linked off of our Features menu that explains from a users > perspective how to use it, and also updating our reference grammar here > (which is derived from content in phoenix.csv): > http://phoenix.apache.org/language/index.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PHOENIX-4136) Document APPROXIMATE_COUNT_DISTINCT function
James Taylor created PHOENIX-4136: - Summary: Document APPROXIMATE_COUNT_DISTINCT function Key: PHOENIX-4136 URL: https://issues.apache.org/jira/browse/PHOENIX-4136 Project: Phoenix Issue Type: Task Reporter: James Taylor Assignee: Ethan Wang Fix For: 4.12.0 Now that is fixed, we need to document this new function by including APPROXIMATE_COUNT_DISTINCT in our list of functions (which lives in phoenix.csv) so that it shows up here: https://phoenix.apache.org/language/functions.html -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (PHOENIX-2460) Implement scrutiny command to validate whether or not an index is in sync with the data table
[ https://issues.apache.org/jira/browse/PHOENIX-2460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] James Taylor resolved PHOENIX-2460. --- Resolution: Fixed Fix Version/s: 4.12.0 Thanks for the contribution, [~vincentpoon]. I've pushed it to 4.x and master branches. Nice work! > Implement scrutiny command to validate whether or not an index is in sync > with the data table > - > > Key: PHOENIX-2460 > URL: https://issues.apache.org/jira/browse/PHOENIX-2460 > Project: Phoenix > Issue Type: Bug >Reporter: James Taylor >Assignee: Vincent Poon > Fix For: 4.12.0 > > Attachments: PHOENIX-2460.patch > > > We should have a process that runs to verify that an index is valid against a > data table and potentially fixes it if discrepancies are found. This could > either be a MR job or a low priority background task. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144482#comment-16144482 ] James Taylor commented on PHOENIX-418: -- Do you know what the difference are, [~mujtabachohan] between "ubuntu-eu2" versus "H4"? Different JVMs? > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144459#comment-16144459 ] James Taylor commented on PHOENIX-418: -- Thanks, [~aertoria]. I've pushed your last patch to 4.x and master branches. Nice work! > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144380#comment-16144380 ] Ethan Wang commented on PHOENIX-418: Changes has been made to pom for using 2.9.5. The flaky IT tests went away with this batch (v6). [~jamestaylor] During the investigation, one thing observed is that both success runs happens when the Jenkins node "ubuntu-eu2" pick it up (versus "H4" during the weekend). I'm not sure if this is related. [~samarthjain] > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144378#comment-16144378 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12884107/PHOENIX-418-v6.patch against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790. ATTACHMENT ID: 12884107 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 56 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) FROM " + tableName + " a, " + tableName + final private void prepareTableWithValues(final Connection conn, final int nRows) throws Exception { + final PreparedStatement stmt = conn.prepareStatement("upsert into " + tableName + " VALUES (?, ?)"); +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + protected final ImmutableBytesWritable valueByteArray = new ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY); +public DistinctCountHyperLogLogAggregateParseNode(String name, List children, BuiltInFunctionInfo info) { +public FunctionExpression create(List children, StatementContext context) throws SQLException { {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryIT Test results: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//testReport/ Javadoc warnings: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//artifact/patchprocess/patchJavadocWarnings.txt Console output: https://builds.apache.org/job/PreCommit-PHOENIX-Build/1314//console This message is automatically generated. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PHOENIX-4135) CSV Bulk Load get slow in last phase of JOB (after 70% reduce completion)
Parakram created PHOENIX-4135: - Summary: CSV Bulk Load get slow in last phase of JOB (after 70% reduce completion) Key: PHOENIX-4135 URL: https://issues.apache.org/jira/browse/PHOENIX-4135 Project: Phoenix Issue Type: Wish Environment: When I run CSV bulkload tool It get slow in last phase i.e. after 70% execution for just 70 GB data.. I guess at the end only single reducer work on entire data to sort it. Is there any way to optimize it? Reporter: Parakram Priority: Critical -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Wang updated PHOENIX-418: --- Attachment: PHOENIX-418-v6.patch > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch, > PHOENIX-418-v6.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (PHOENIX-4134) Per the changes at Hbase 1.3.1, upgrade ConnectionQueryServiceImpl to use ClusterConnection rather than HConnection
Ethan Wang created PHOENIX-4134: --- Summary: Per the changes at Hbase 1.3.1, upgrade ConnectionQueryServiceImpl to use ClusterConnection rather than HConnection Key: PHOENIX-4134 URL: https://issues.apache.org/jira/browse/PHOENIX-4134 Project: Phoenix Issue Type: Bug Reporter: Ethan Wang Since hbase 1.2.4, Hconnection is deprecated by ClusterConnection. This change needs to be updated to Phoenix as well. [~alexaraujo] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144178#comment-16144178 ] James Taylor commented on PHOENIX-418: -- [~aertoria]. Thanks for the updated patch. Please update the phoenix-core/pom.xml to use the 2.9.5 version of com.clearspring.analytics:stream instead of 2.7.0 if there's no reason not to go with the latest. Any idea why the test runs are so flaky? I'm going to try running them locally too, but it seems like more than the usual suspects failing. > Support approximate COUNT DISTINCT > -- > > Key: PHOENIX-418 > URL: https://issues.apache.org/jira/browse/PHOENIX-418 > Project: Phoenix > Issue Type: Task >Reporter: James Taylor >Assignee: Ethan Wang > Labels: gsoc2016 > Attachments: PHOENIX-418-v1.patch, PHOENIX-418-v2.patch, > PHOENIX-418-v3.patch, PHOENIX-418-v4.patch, PHOENIX-418-v5.patch > > > Support an "approximation" of count distinct to prevent having to hold on to > all distinct values (since this will not scale well when the number of > distinct values is huge). The Apache Drill folks have had some interesting > discussions on this > [here](http://mail-archives.apache.org/mod_mbox/incubator-drill-dev/201306.mbox/%3CJIRA.12650169.1369931282407.88049.1370645900553%40arcas%3E). > They recommend using [Welford's > method](http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance_Online_algorithm). > I'm open to having a config option that uses exact versus approximate. I > don't have experience implementing an approximate implementation, so I'm not > sure how much state is required to keep on the server and return to the > client (other than realizing it'd be much less that returning all distinct > values and their counts). -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-4132) TestIndexWriter causes builds to hang sometimes
[ https://issues.apache.org/jira/browse/PHOENIX-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16144158#comment-16144158 ] James Taylor commented on PHOENIX-4132: --- +1 > TestIndexWriter causes builds to hang sometimes > --- > > Key: PHOENIX-4132 > URL: https://issues.apache.org/jira/browse/PHOENIX-4132 > Project: Phoenix > Issue Type: Bug >Reporter: Samarth Jain >Assignee: Samarth Jain > Attachments: PHOENIX-4132_4.x-HBase-0.98.patch > > > Below is the jstack of the threads: > {code} > "main" #1 prio=5 os_prio=31 tid=0x7fdd3f805000 nid=0x1c03 waiting on > condition [0x79bda000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007a00bb8b0> (a > com.google.common.util.concurrent.AbstractFuture$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:280) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > org.apache.phoenix.hbase.index.parallel.BaseTaskRunner.submit(BaseTaskRunner.java:66) > at > org.apache.phoenix.hbase.index.parallel.BaseTaskRunner.submitUninterruptible(BaseTaskRunner.java:99) > at > org.apache.phoenix.hbase.index.write.ParallelWriterIndexCommitter.write(ParallelWriterIndexCommitter.java:197) > at > org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:189) > at > org.apache.phoenix.hbase.index.write.IndexWriter.write(IndexWriter.java:175) > at > org.apache.phoenix.hbase.index.write.TestIndexWriter.testFailureOnRunningUpdateAbortsPending(TestIndexWriter.java:212) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > at org.junit.rules.RunRules.evaluate(RunRules.java:20) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:272) > at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:236) > at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:386) > at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:323) > at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:143) > {code} > {code} > "pool-8-thread-1" #25 prio=5 os_prio=31 tid=0x7fdd3ef1a000 nid=0x130b > waiting on condition [0x7bf44000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007a00add50> (a > java.util.concurrent.CountDownLatch$Sync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) >
[jira] [Updated] (PHOENIX-4133) [hive] ColumnInfo list should be reordered and filtered refer the hive tables
[ https://issues.apache.org/jira/browse/PHOENIX-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhuQQ updated PHOENIX-4133: --- Description: In some case, we create hive tables with different order, and may not contains all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` not works well. For example: {code:sql} -- In Phoenix: CREATE TABLE IF NOT EXISTS test ( key1 VARCHAR NOT NULL, key2 INTEGER NOT NULL, key3 VARCHAR, pv BIGINT, uv BIGINT, CONSTRAINT PK PRIMARY KEY (key1, key2, key3) ); {code} {code:sql} -- In Hive: CREATE EXTERNAL TABLE test.test_part ( key1 string, key2 int, pv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2", "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv" ); CREATE EXTERNAL TABLE test.test_uv ( key1 string, key2 int, key3 string, app_version string, channel string, uv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2,key3", "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv" ); {code} Then insert to {{test.test_part}}: {code:sql} INSERT INTO test.test_part SELECT 'some key', 20170828,80; {code} throws error: {code:java} ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR {code} And insert to {{test.test_uv}}: {code:sql} INSERT INTO test.test_uv SELECT 'some key',20170828,'linux',11; {code} Job executed successfully, but pv is overrided to 11 and uv is still NULL. PS: haven't test other versions, but by check the latest source code, new versions may also have same problems was: In some case, we create hive tables with differen order, and may not contains all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` not works well. For example: {code:sql} -- In Phoenix: CREATE TABLE IF NOT EXISTS test ( key1 VARCHAR NOT NULL, key2 INTEGER NOT NULL, key3 VARCHAR, pv BIGINT, uv BIGINT, CONSTRAINT PK PRIMARY KEY (key1, key2, key3) ); {code} {code:sql} -- In Hive: CREATE EXTERNAL TABLE test.test_part ( key1 string, key2 int, pv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2", "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv" ); CREATE EXTERNAL TABLE test.test_uv ( key1 string, key2 int, key3 string, app_version string, channel string, uv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2,key3", "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv" ); {code} Then insert to {{test.test_part}}: {code:sql} INSERT INTO test.test_part SELECT 'some key', 20170828,80; {code} throws error: {code:java} ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR {code} And insert to {{test.test_uv}}: {code:sql} INSERT INTO test.test_uv SELECT 'some key',20170828,'linux',11; {code} Job executed successfully, but pv is overrided to 11 and uv is still NULL. PS: not test other versions, but by check the latest source code, new versions may also have same problems > [hive] ColumnInfo list should be reordered and filtered refer the hive tables > - > > Key: PHOENIX-4133 > URL: https://issues.apache.org/jira/browse/PHOENIX-4133 > Project: Phoenix > Issue Type: Bug >Affects Versions: 4.9.0 >Reporter: ZhuQQ > > In some case, we create hive tables with different order, and may
[jira] [Updated] (PHOENIX-4133) [hive] ColumnInfo list should be reordered and filtered refer the hive tables
[ https://issues.apache.org/jira/browse/PHOENIX-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ZhuQQ updated PHOENIX-4133: --- Description: In some case, we create hive tables with differen order, and may not contains all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` not works well. For example: {code:sql} -- In Phoenix: CREATE TABLE IF NOT EXISTS test ( key1 VARCHAR NOT NULL, key2 INTEGER NOT NULL, key3 VARCHAR, pv BIGINT, uv BIGINT, CONSTRAINT PK PRIMARY KEY (key1, key2, key3) ); {code} {code:sql} -- In Hive: CREATE EXTERNAL TABLE test.test_part ( key1 string, key2 int, pv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2", "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv" ); CREATE EXTERNAL TABLE test.test_uv ( key1 string, key2 int, key3 string, app_version string, channel string, uv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2,key3", "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv" ); {code} Then insert to {{test.test_part}}: {code:sql} INSERT INTO test.test_part SELECT 'some key', 20170828,80; {code} throws error: {code:java} ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR {code} And insert to {{test.test_uv}}: {code:sql} INSERT INTO test.test_uv SELECT 'some key',20170828,'linux',11; {code} Job executed successfully, but pv is overrided to 11 and uv is still NULL. PS: not test other versions, but by check the latest source code, new versions may also have same problems was: In some case, we create hive tables with differen order, and may not contains all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` not works well. For example: {code:sql} -- In Phoenix: CREATE TABLE IF NOT EXISTS test ( key1 VARCHAR NOT NULL, key2 INTEGER NOT NULL, key3 VARCHAR, pv BIGINT, uv BIGINT, CONSTRAINT PK PRIMARY KEY (key1, key2, key3) ); {code} {code:sql} -- In Hive: CREATE EXTERNAL TABLE test.test_part ( key1 string, key2 int, pv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2", "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv" ); CREATE EXTERNAL TABLE test.test_uv ( key1 string, key2 int, key3 string, app_version string, channel string, uv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2,key3", "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv" ); {code} Then insert to {{test.test_part}}: {code:sql} INSERT INTO test.test_part SELECT 'some key', 20170828,80; {code} throws error: {code:java} ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR {code} And insert to {{test.test_uv}}: {code:sql} INSERT INTO test.test_uv SELECT 'some key', 20170828,'linux',11; {code} Job executed successfully, but pv is overrided to 11 and uv is still NULL. > [hive] ColumnInfo list should be reordered and filtered refer the hive tables > - > > Key: PHOENIX-4133 > URL: https://issues.apache.org/jira/browse/PHOENIX-4133 > Project: Phoenix > Issue Type: Bug >Affects Versions: 4.9.0 >Reporter: ZhuQQ > > In some case, we create hive tables with differen order, and may not contains > all columns in the phoenix tables, then we found `INSERT INTO test SELECT > ...` not works well. > For exam
[jira] [Created] (PHOENIX-4133) [hive] ColumnInfo list should be reordered and filtered refer the hive tables
ZhuQQ created PHOENIX-4133: -- Summary: [hive] ColumnInfo list should be reordered and filtered refer the hive tables Key: PHOENIX-4133 URL: https://issues.apache.org/jira/browse/PHOENIX-4133 Project: Phoenix Issue Type: Bug Affects Versions: 4.9.0 Reporter: ZhuQQ In some case, we create hive tables with differen order, and may not contains all columns in the phoenix tables, then we found `INSERT INTO test SELECT ...` not works well. For example: {code:sql} -- In Phoenix: CREATE TABLE IF NOT EXISTS test ( key1 VARCHAR NOT NULL, key2 INTEGER NOT NULL, key3 VARCHAR, pv BIGINT, uv BIGINT, CONSTRAINT PK PRIMARY KEY (key1, key2, key3) ); {code} {code:sql} -- In Hive: CREATE EXTERNAL TABLE test.test_part ( key1 string, key2 int, pv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2", "phoenix.column.mapping" = "key1:key1,key2:key2,pv:pv" ); CREATE EXTERNAL TABLE test.test_uv ( key1 string, key2 int, key3 string, app_version string, channel string, uv bigint ) STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler' TBLPROPERTIES ( "phoenix.table.name" = "test", "phoenix.zookeeper.quorum" = "localhost", "phoenix.zookeeper.znode.parent" = "/hbase", "phoenix.zookeeper.client.port" = "2181", "phoenix.rowkeys" = "key1,key2,key3", "phoenix.column.mapping" = "key1:key1,key2:key2,key3:key3,uv:uv" ); {code} Then insert to {{test.test_part}}: {code:sql} INSERT INTO test.test_part SELECT 'some key', 20170828,80; {code} throws error: {code:java} ERROR 203 (22005): Type mismatch. BIGINT cannot be coerced to VARCHAR {code} And insert to {{test.test_uv}}: {code:sql} INSERT INTO test.test_uv SELECT 'some key', 20170828,'linux',11; {code} Job executed successfully, but pv is overrided to 11 and uv is still NULL. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (PHOENIX-418) Support approximate COUNT DISTINCT
[ https://issues.apache.org/jira/browse/PHOENIX-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143482#comment-16143482 ] Hadoop QA commented on PHOENIX-418: --- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12883988/PHOENIX-418-v5.patch against master branch at commit 435441ea8ba336e1967b03cf84f1868c5ef14790. ATTACHMENT ID: 12883988 {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 56 warning messages. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces the following lines longer than 100: + String query = "SELECT APPROX_COUNT_DISTINCT(a.i1||a.i2||b.i2) FROM " + tableName + " a, " + tableName + final private void prepareTableWithValues(final Connection conn, final int nRows) throws Exception { + final PreparedStatement stmt = conn.prepareStatement("upsert into " + tableName + " VALUES (?, ?)"); +@BuiltInFunction(name=DistinctCountHyperLogLogAggregateFunction.NAME, nodeClass=DistinctCountHyperLogLogAggregateParseNode.class, args= {@Argument()} ) +public DistinctCountHyperLogLogAggregateFunction(List childExpressions, CountAggregateFunction delegate){ + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + private HyperLogLogPlus hll = new HyperLogLogPlus(DistinctCountHyperLogLogAggregateFunction.NormalSetPrecision, DistinctCountHyperLogLogAggregateFunction.SparseSetPrecision); + protected final ImmutableBytesWritable valueByteArray = new ImmutableBytesWritable(ByteUtil.EMPTY_BYTE_ARRAY); +public DistinctCountHyperLogLogAggregateParseNode(String name, List children, BuiltInFunctionInfo info) { +public FunctionExpression create(List children, StatementContext context) throws SQLException { {color:red}-1 core tests{color}. The patch failed these unit tests: ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.salted.SaltedTableVarLengthRowKeyIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DerivedTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SequenceIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CustomEntityDataIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TransactionalViewIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.MutableIndexReplicationIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CreateTableIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.OrderByIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.FirstValuesFunctionIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatementHintsIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RowValueConstructorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.hadoop.hbase.regionserver.wal.WALRecoveryRegionPostOpenIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.PowerFunctionEnd2EndIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.InListIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.IsNullIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.txn.RollbackIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.StatsCollectorIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.DynamicUpsertIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.QueryExecWithoutSCNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.SubqueryUsingSortMergeJoinIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.RenewLeaseIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.index.ChildViewsUseParentViewIndexIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.TopNIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.EvaluationOfORIT ./phoenix-core/target/failsafe-reports/TEST-org.apache.phoenix.end2end.CsvBulkLoadToolIT