[jira] [Updated] (PHOENIX-5353) Incorrect results when JOINs project array elements
[ https://issues.apache.org/jira/browse/PHOENIX-5353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-5353: Description: There are two errors observed below, one with HASH JOIN and the other with SORT MERGE JOIN. Here are the steps for reproducing the bug in array element projection when using sort merge join. I also noticed there is an issue with hash-joins (see the last example). Create tables and upsert test data. 0: jdbc:phoenix:localhost:2181:/hbase> create table d (c1 integer primary key, c2 integer, c3 integer[]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (1, 1, array[1,2,3]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (2, 10, array[2,3,4]); 0: jdbc:phoenix:localhost:2181:/hbase> create table t2 (c1 integer primary key, c2 integer); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (1, 10); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (2, 20); Verify data in the tables 0: jdbc:phoenix:localhost:2181:/hbase> select * from d; +--+++ |C1|C2|C3| +--+++ |1|1|[1, 2, 3]| |2|10|[2, 3, 4]| +--+++ 2 rows selected (0.027 seconds) 0: jdbc:phoenix:localhost:2181:/hbase> select * from t2; +--++ |C1|C2| +--++ |1|10| |2|20| +--++ 2 rows selected (0.021 seconds) Perform sort merge join without projecting array elements. This works fine. 0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.c1, d.c2, d.c3, t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; ++--++++ |D.C1|D.C2|D.C3|T2.C1|T2.C2| ++--++++ |1|1|[1, 2, 3]|1|10| |2|10|[2, 3, 4]|2|20| ++--++++ 2 rows selected (0.054 seconds) Perform sort merge join by projecting array elements. Returns junk data for array elements. 0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.c1, d.c2, d.c3[1], d.c3[2], d.c3[3], t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; ++--++++--++ |D.C1|D.C2|ARRAY_ELEM(D.C3, 1)|ARRAY_ELEM(D.C3, 2)|ARRAY_ELEM(D.C3, 3)|T2.C1|T2.C2| ++--++++--++ |1|1|-1937768448|-2122317824|-2105540608|1|10| |2|10|-1937768448|-2105540608|-2088763392|2|20| ++--++++--++ 2 rows selected (0.043 seconds) Array element projection works fine when using hash-join but columns from the non-array table are messed up. 0: jdbc:phoenix:localhost:2181:/hbase> select d.c1, d.c2, d.c3[1], d.c3[2], d.c3[3], t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; ++--++++++-- |D.C1|D.C2|ARRAY_ELEM(D.C3, 1)|ARRAY_ELEM(D.C3, 2)|ARRAY_ELEM(D.C3, 3)|T2.C1|T2.C2| ++--++++++-- |1|1|1|2|3|-2146795520|-2147319808| |2|10|2|3|4|-2146140160|-2147319808| ++--++++++-- 2 rows selected (0.067 seconds) was: There are two errors observed below, one with HASH JOIN and the other with SORT MERGE JOIN. Here are the steps for reproducing the bug in array element projection when using sort merge join. I also noticed there is an issue with hash-joins (see the last example). * Create tables and upsert test data. 0: jdbc:phoenix:localhost:2181:/hbase> create table d (c1 integer primary key, c2 integer, c3 integer[]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (1, 1, array[1,2,3]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (2, 10, array[2,3,4]); 0: jdbc:phoenix:localhost:2181:/hbase> create table t2 (c1 integer primary key, c2 integer); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (1, 10); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (2, 20); Verify data in the tables 0: jdbc:phoenix:localhost:2181:/hbase> select * from d; +-+-++ | C1 | C2 | C3 | +-+-++ | 1 | 1 | [1, 2, 3] | | 2 | 10 | [2, 3, 4] | +-+-++ 2 rows selected (0.027 seconds) 0: jdbc:phoenix:localhost:2181:/hbase> select * from t2; +-+-+ | C1 | C2 | +-+-+ | 1 | 10 | | 2 | 20 | +-+-+ 2 rows selected (0.021 seconds) Perform sort merge join without projecting array elements. This works fine. 0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.
[jira] [Updated] (PHOENIX-5353) Incorrect results when JOINs project array elements
[ https://issues.apache.org/jira/browse/PHOENIX-5353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-5353: Description: There are two errors observed below, one with HASH JOIN and the other with SORT MERGE JOIN. Here are the steps for reproducing the bug in array element projection when using sort merge join. I also noticed there is an issue with hash-joins (see the last example). * Create tables and upsert test data. 0: jdbc:phoenix:localhost:2181:/hbase> create table d (c1 integer primary key, c2 integer, c3 integer[]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (1, 1, array[1,2,3]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (2, 10, array[2,3,4]); 0: jdbc:phoenix:localhost:2181:/hbase> create table t2 (c1 integer primary key, c2 integer); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (1, 10); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (2, 20); Verify data in the tables 0: jdbc:phoenix:localhost:2181:/hbase> select * from d; +-+-++ | C1 | C2 | C3 | +-+-++ | 1 | 1 | [1, 2, 3] | | 2 | 10 | [2, 3, 4] | +-+-++ 2 rows selected (0.027 seconds) 0: jdbc:phoenix:localhost:2181:/hbase> select * from t2; +-+-+ | C1 | C2 | +-+-+ | 1 | 10 | | 2 | 20 | +-+-+ 2 rows selected (0.021 seconds) Perform sort merge join without projecting array elements. This works fine. 0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.c1, d.c2, d.c3, t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; +---+---++++ | D.C1 | D.C2 |D.C3| T2.C1 | T2.C2 | +---+---++++ | 1 | 1 | [1, 2, 3] | 1 | 10 | | 2 | 10| [2, 3, 4] | 2 | 20 | +---+---++++ 2 rows selected (0.054 seconds) Perform sort merge join by projecting array elements. Returns junk data for array elements. 0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.c1, d.c2, d.c3[1], d.c3[2], d.c3[3], t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; +---+---+--+--+--+++ | D.C1 | D.C2 | ARRAY_ELEM(D.C3, 1) | ARRAY_ELEM(D.C3, 2) | ARRAY_ELEM(D.C3, 3) | T2.C1 | T2.C2 | +---+---+--+--+--+++ | 1 | 1 | -1937768448 | -2122317824 | -2105540608 | 1 | 10 | | 2 | 10| -1937768448 | -2105540608 | -2088763392 | 2 | 20 | +---+---+--+--+--+++ 2 rows selected (0.043 seconds) Array element projection works fine when using hash-join but columns from the non-array table are messed up. 0: jdbc:phoenix:localhost:2181:/hbase> select d.c1, d.c2, d.c3[1], d.c3[2], d.c3[3], t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; +---+---+--+--+--+--+--+ | D.C1 | D.C2 | ARRAY_ELEM(D.C3, 1) | ARRAY_ELEM(D.C3, 2) | ARRAY_ELEM(D.C3, 3) |T2.C1 |T2.C2 | +---+---+--+--+--+--+--+ | 1 | 1 | 1| 2| 3 | -2146795520 | -2147319808 | | 2 | 10| 2| 3| 4 | -2146140160 | -2147319808 | +---+---+--+--+--+--+--+ 2 rows selected (0.067 seconds) was: There are two errors observed below, one with HASH JOIN and the other with SORT MERGE JOIN. Here are the steps for reproducing the bug in array element projection when using sort merge join. I also noticed there is an issue with hash-joins (see the last example). Create tables and upsert test data. 0: jdbc:phoenix:localhost:2181:/hbase> create table d (c1 integer primary key, c2 integer, c3 integer[]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (1, 1, array[1,2,3]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (2, 10, array[2,3,4]); 0: jdbc:phoenix:localhost:2181:/hbase> create table t2 (c1 integer primary key, c2 integer); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (1, 10); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (2, 20); Verify data in the tables 0: jdbc:phoenix:localhost:2181:/hbase> select * from d; +-+-++ | C1 | C2 | C3 | +-+-++ | 1 | 1 | [1, 2, 3] | | 2
[jira] [Updated] (PHOENIX-5353) Incorrect results when JOINs project array elements
[ https://issues.apache.org/jira/browse/PHOENIX-5353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-5353: Description: There are two errors observed below, one with HASH JOIN and the other with SORT MERGE JOIN. Here are the steps for reproducing the bug in array element projection when using sort merge join. I also noticed there is an issue with hash-joins (see the last example). Create tables and upsert test data. 0: jdbc:phoenix:localhost:2181:/hbase> create table d (c1 integer primary key, c2 integer, c3 integer[]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (1, 1, array[1,2,3]); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (2, 10, array[2,3,4]); 0: jdbc:phoenix:localhost:2181:/hbase> create table t2 (c1 integer primary key, c2 integer); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (1, 10); 0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (2, 20); Verify data in the tables 0: jdbc:phoenix:localhost:2181:/hbase> select * from d; +-+-++ | C1 | C2 | C3 | +-+-++ | 1 | 1 | [1, 2, 3] | | 2 | 10 | [2, 3, 4] | +-+-++ 2 rows selected (0.027 seconds) 0: jdbc:phoenix:localhost:2181:/hbase> select * from t2; +-+-+ | C1 | C2 | +-+-+ | 1 | 10 | | 2 | 20 | +-+-+ 2 rows selected (0.021 seconds) Perform sort merge join without projecting array elements. This works fine. 0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.c1, d.c2, d.c3, t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; +---+---++++ | D.C1 | D.C2 |D.C3| T2.C1 | T2.C2 | +---+---++++ | 1 | 1 | [1, 2, 3] | 1 | 10 | | 2 | 10| [2, 3, 4] | 2 | 20 | +---+---++++ 2 rows selected (0.054 seconds) Perform sort merge join by projecting array elements. Returns junk data for array elements. 0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.c1, d.c2, d.c3[1], d.c3[2], d.c3[3], t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; +---+---+--+--+--+++ | D.C1 | D.C2 | ARRAY_ELEM(D.C3, 1) | ARRAY_ELEM(D.C3, 2) | ARRAY_ELEM(D.C3, 3) | T2.C1 | T2.C2 | +---+---+--+--+--+++ | 1 | 1 | -1937768448 | -2122317824 | -2105540608 | 1 | 10 | | 2 | 10| -1937768448 | -2105540608 | -2088763392 | 2 | 20 | +---+---+--+--+--+++ 2 rows selected (0.043 seconds) Array element projection works fine when using hash-join but columns from the non-array table are messed up. 0: jdbc:phoenix:localhost:2181:/hbase> select d.c1, d.c2, d.c3[1], d.c3[2], d.c3[3], t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1; +---+---+--+--+--+--+--+ | D.C1 | D.C2 | ARRAY_ELEM(D.C3, 1) | ARRAY_ELEM(D.C3, 2) | ARRAY_ELEM(D.C3, 3) |T2.C1 |T2.C2 | +---+---+--+--+--+--+--+ | 1 | 1 | 1| 2| 3 | -2146795520 | -2147319808 | | 2 | 10| 2| 3| 4 | -2146140160 | -2147319808 | +---+---+--+--+--+--+--+ 2 rows selected (0.067 seconds) was: There are two errors observed below, one with HASH JOIN and the other with SORT MERGE JOIN. Here are the steps for reproducing the bug in array element projection when using sort merge join. I also noticed there is an issue with hash-joins (see the last example). * Create tables and upsert test data. {quote}{quote}0: jdbc:phoenix:localhost:2181:/hbase> create table d (c1 integer primary key, c2 integer, c3 integer[]);{quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (1, 1, array[1,2,3]);{quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (2, 10, array[2,3,4]);{quote} {quote} {quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> create table t2 (c1 integer primary key, c2 integer);{quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (1, 10);{quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (2, 20);{quote}{quote} * Verify data in the tables {quote}{quote}0: jdbc:phoenix:localhost:2181:/hbase>
[jira] [Created] (PHOENIX-5353) Incorrect results when JOINs project array elements
Gerald Sangudi created PHOENIX-5353: --- Summary: Incorrect results when JOINs project array elements Key: PHOENIX-5353 URL: https://issues.apache.org/jira/browse/PHOENIX-5353 Project: Phoenix Issue Type: Bug Reporter: Gerald Sangudi There are two errors observed below, one with HASH JOIN and the other with SORT MERGE JOIN. Here are the steps for reproducing the bug in array element projection when using sort merge join. I also noticed there is an issue with hash-joins (see the last example). * Create tables and upsert test data. {quote}{quote}0: jdbc:phoenix:localhost:2181:/hbase> create table d (c1 integer primary key, c2 integer, c3 integer[]);{quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (1, 1, array[1,2,3]);{quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> upsert into d values (2, 10, array[2,3,4]);{quote} {quote} {quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> create table t2 (c1 integer primary key, c2 integer);{quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (1, 10);{quote} {quote}0: jdbc:phoenix:localhost:2181:/hbase> upsert into t2 values (2, 20);{quote}{quote} * Verify data in the tables {quote}{quote}0: jdbc:phoenix:localhost:2181:/hbase> select * from d; +-+-++ | C1 | C2 | C3 | +-+-++ | 1 | 1 | [1, 2, 3] | | 2 | 10 | [2, 3, 4] | +-+-++ 2 rows selected (0.027 seconds) 0: jdbc:phoenix:localhost:2181:/hbase> select * from t2; +-+-+ | C1 | C2 | +-+-+ | 1 | 10 | | 2 | 20 | +-+-+ 2 rows selected (0.021 seconds){quote}{quote} * Perform sort merge join without projecting array elements. This works fine. {quote}{quote}0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.c1, d.c2, d.c3, t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1;{quote} {quote}+---+---++++{quote} {quote}| D.C1 | D.C2 | D.C3 | T2.C1 | T2.C2 |{quote} {quote}+---+---++++{quote} {quote}| 1 | 1 | [1, 2, 3] | 1 | 10 |{quote} {quote}| 2 | 10 | [2, 3, 4] | 2 | 20 |{quote} {quote}+---+---++++{quote} {quote}2 rows selected (0.054 seconds){quote}{quote} * Perform sort merge join by projecting array elements. *Returns junk data for array elements.* {quote}{quote}0: jdbc:phoenix:localhost:2181:/hbase> select /*+ use_sort_merge_join */ d.c1, d.c2, d.c3[1], d.c3[2], d.c3[3], t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1;{quote} {quote}+---+---+--+--+--+++{quote} {quote}| D.C1 | D.C2 | ARRAY_ELEM(D.C3, 1) | ARRAY_ELEM(D.C3, 2) | ARRAY_ELEM(D.C3, 3) | T2.C1 | T2.C2 |{quote} {quote}+---+---+--+--+--+++{quote} {quote}| 1 | 1 | {color:#ff}-1937768448{color} | {color:#ff}-2122317824{color} | {color:#ff}-2105540608{color} | 1 | 10 |{quote} {quote}| 2 | 10 | {color:#ff}-1937768448{color} | {color:#ff}-2105540608{color} | {color:#ff}-2088763392{color} | 2 | 20 |{quote} {quote}+---+---+--+--+--+++{quote} {quote}2 rows selected (0.043 seconds){quote}{quote} * Array element projection works fine when using hash-join but *columns from the non-array table are messed up*. {quote}{quote}0: jdbc:phoenix:localhost:2181:/hbase> select d.c1, d.c2, d.c3[1], d.c3[2], d.c3[3], t2.c1, t2.c2 from d join t2 on d.c1 = t2.c1;{quote} {quote}+---+---+--+--+--+--+--+{quote} {quote}| D.C1 | D.C2 | ARRAY_ELEM(D.C3, 1) | ARRAY_ELEM(D.C3, 2) | ARRAY_ELEM(D.C3, 3) | T2.C1 | T2.C2 |{quote} {quote}+---+---+--+--+--+--+--+{quote} {quote}| 1 | 1 | 1 | 2 | 3 | {color:#ff}-2146795520{color} | {color:#ff}-2147319808{color} |{quote} {quote}| 2 | 10 | 2 | 3 | 4 | {color:#ff}-2146140160{color} | {color:#ff}-2147319808{color} |{quote} {quote}+---+---+--+--+--+--+--+{quote} {quote}2 rows selected (0.067 seconds){quote}{quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Build failure
Hi folks, The build is failing for me on the latest master. I also tried a fresh install. The latest commit that's failing is https://github.com/apache/phoenix/commit/1455763b43cc71bff9e69daf928be59251e97235 The first error is a dependency on a HBase class that's not included in Phoenix. Previous commits work fine for me. Thanks, Gerald
[jira] [Updated] (PHOENIX-4757) composite key salt_buckets
[ https://issues.apache.org/jira/browse/PHOENIX-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4757: Comment: was deleted (was: [~tdsilva] - id_1and id_2 are not constant. They are meaningful in the application domain, and they vary by record. There would be no write hotspotting if new writes arrive in random order wrt id_2.) > composite key salt_buckets > -- > > Key: PHOENIX-4757 > URL: https://issues.apache.org/jira/browse/PHOENIX-4757 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.11.0 >Reporter: cmd >Priority: Major > Fix For: 4.11.0 > > > CREATE TABLE IF NOT EXISTS user_events ( > user_id VARCHAR NOT NULL, > event_type VARCHAR NOT NULL, > event_time VARCHAR NOT NULL > event_msg VARCHAR NOT NULL > event_status VARCHAR NOT NULL > event_opt VARCHAR NOT NULL > CONSTRAINT my_pk PRIMARY KEY (user_id,event_type,event_time)) > SALT_BUCKETS=128; > and my query is: > 1.select event_type,count(0) from us_population where user_id='' group > by event_type > 2.select count(0) from us_population where user_id='' and > event_type='0101' > 3.select * from us_population where user_id='' and event_type='0101' and > event_time>'20180101' and event_time<'20180201' order by event_time limit > 50,100 > Concurrency query ratio: > 1:80% > 2:10% > 3:10% > user_events data :50billion > It can be a field/some fileds of the primary key salted by hash > grammar with "SALT_BUCKETS(user_id)=4" or > "SALT_BUCKETS(user_id,event_type)=4" > ref: > > [https://www.safaribooksonline.com/library/view/greenplum-architecture/9781940540337/xhtml/chapter03.xhtml] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Salting based on partial rowkeys
Hi folks, I would like to continue the discussion below from a few weeks ago. I would like to address the feedback from Thomas, Jaanai, and Sergey. In thinking about this some more, it amounts to introducing a form of hash partitioning in Phoenix + HBase. For this to work in the general case, region splits and merges would need to be disabled, and the regions pre-defined. This is already supported by DisableRegionSplitPolicy. I understand that HBase is range-partitioned. What we are proposing is to allow specialized use cases for users to manage their own partitioning. All other users would be unaffected. Jaanai also mentioned that salting is meant to address write hotspots. That is true. We are proposing an additional use of salting (or if you prefer, another feature altogether) for specialized use cases. Again, all other users would be unaffected. We have a lot of data in HBase, and some of our queries would benefit from some sort of hash partitioning. That is the crux of our proposal. Some specific responses: @Josh -- exactly. @Thomas -- if we can partition the data exactly how we want, we can make sure that certain queries do not go across regions. We can either scan only the matching regions, and we can perform the full aggregation within each matching region, without having to do a merge. @Jaanai, @Sergey -- I hope I explained about hotspotting above. We would also be fine calling it something other than salting. Maybe that's better to avoid confusion. @Lars -- you posted questions in PHOENIX-4757. I will respond there. Thanks everyone for all the feedback on this. Our goal is to discuss all the concerns, and then finally get a yay or nay consensus from the committers. Gerald On Sun, Sep 16, 2018 at 9:52 PM la...@apache.org wrote: > I added some comments on the PHOENIX-4757 > > On Thursday, September 13, 2018, 6:42:12 PM PDT, Josh Elser < > els...@apache.org> wrote: > > Ahh, I get you now. > > For a composite primary key made up of columns 1 through N, you want > similar controls to compute the value of the salt based on a sequence of > the columns 1 through M where M <= N (instead of always on all columns). > > For large numbers of salt buckets and a scan over a facet, you prune > your search space considerably. Makes sense to me! > > On 9/13/18 6:37 PM, Gerald Sangudi wrote: > > In case the text formatting is lost below, I also added it as a comment > in > > the JIRA ticket: > > > > https://issues.apache.org/jira/browse/PHOENIX-4757 > > > > > > On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi > > wrote: > > > >> Sorry I missed Josh's reply; I've subscribed to the dev list now. > >> > >> Below is a copy-and-paste from our internal document. Thanks in advance > >> for your review and additional feedback on this. > >> > >> Gerald > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> *BackgroundWe make extensive use of multi-column rowkeys and salting > >> <https://phoenix.apache.org/salted.html> in our different apache > phoenix > >> deployments. We frequently perform group-by aggregations on these data > >> along a specific dimension that would benefit from predictably > partitioning > >> the data along that dimension. Proposal:We propose to add table > metadata to > >> allow schema designers to constrain salting to a subset of the rowkey, > >> rather than the full rowkey as it is today. This will introduce a > mechanism > >> to partition data on a per-table basis along a single dimension without > >> application changes or much change to the phoenix runtime logic. We > expect > >> this will result in substantially faster group-by’s along the salted > >> dimension and negligible penalties elsewhere. This feature has also been > >> proposed in PHOENIX-4757 > >> <https://issues.apache.org/jira/browse/PHOENIX-4757> where it was > pointed > >> out that partitioning and sorting data along different dimensions is a > >> common pattern in other datastores as well.Theoretically, it could cause > >> hotspotting when querying along the salted dimension without the leading > >> rowkey - that would be an anti-pattern.Usage > ExampleCurrent:Schema:CREATE > >> TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key > >> BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2, > >> other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From > >> relationshipWhere id_1 in (2,3
[jira] [Created] (PHOENIX-4948) Add test case for merging nullified array elements
Gerald Sangudi created PHOENIX-4948: --- Summary: Add test case for merging nullified array elements Key: PHOENIX-4948 URL: https://issues.apache.org/jira/browse/PHOENIX-4948 Project: Phoenix Issue Type: Test Reporter: Gerald Sangudi Assignee: Gerald Sangudi See PHOENIX-4917. That ticket fixes a ClassCastException when projecting array elements in hash join. This ticket is to add a test case that exercises the merge path in that code, specifically HashJoinRegionScanner.mergeProjectedValue(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4948) Add test case for merging nullified array elements
[ https://issues.apache.org/jira/browse/PHOENIX-4948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4948: Description: See PHOENIX-4917. That ticket fixes a ClassCastException when projecting array elements in hash join. This ticket is to add a test case that exercises the merge path in that code, specifically HashJoinRegionScanner.mergeProjectedValue(). Cc [~twdsi...@gmail.com] was: See PHOENIX-4917. That ticket fixes a ClassCastException when projecting array elements in hash join. This ticket is to add a test case that exercises the merge path in that code, specifically HashJoinRegionScanner.mergeProjectedValue(). > Add test case for merging nullified array elements > -- > > Key: PHOENIX-4948 > URL: https://issues.apache.org/jira/browse/PHOENIX-4948 > Project: Phoenix > Issue Type: Test > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > > See PHOENIX-4917. That ticket fixes a ClassCastException when projecting > array elements in hash join. This ticket is to add a test case that exercises > the merge path in that code, specifically > HashJoinRegionScanner.mergeProjectedValue(). > Cc [~twdsi...@gmail.com] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4917) ClassCastException when projecting array elements in hash join
[ https://issues.apache.org/jira/browse/PHOENIX-4917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4917: Description: This bug was introduced in the fix for PHOENIX-4791. When projecting array elements in hash join, we now generate both ProjectedTupleValue and MultiKeyTupleValue. Before the fix for PHOENIX-4791, hash join was only generating ProjectedTupleValue, and there were two lines of code with class casts that reflected this assumption. The fix is to handle both ProjectedTupleValue and MultiKeyTupleValue, while continuing to propagate the array cell as in PHOENIX-4791. The stack trace with the ClassCastException: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): org.apache.hadoop.hbase.DoNotRetryIOException: GENO_DOSE,,1537598769044.1a6cb8853b036c59e7515d8e876e28c5.: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:96) at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:62) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:300) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:294) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2633) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2837) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) Caused by: java.lang.ClassCastException: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.coprocessor.HashJoinRegionScanner.processResults(HashJoinRegionScanner.java:220) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:294) was: This bug was introduced in the fix for PHOENIX-4791. When projecting array elements in hash join, we now generate both ProjectedTupleValue and MultiKeyTupleValue. Previously, we were only generating ProjectedTupleValue, and there are two lines of code that contain this assumption in class casts. The fix is to merge into the MultiKeyTupleValue, while propagating the array cell as in PHOENIX-4791. The stack trace with the ClassCastException: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): org.apache.hadoop.hbase.DoNotRetryIOException: GENO_DOSE,,1537598769044.1a6cb8853b036c59e7515d8e876e28c5.: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:96) at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:62) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:300) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:294) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2633) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2837) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) Caused by: java.lang.ClassCastException: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.coprocessor.HashJoinRegionScanner.processResults(HashJoinRegionScanner.java:220) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:294
[jira] [Updated] (PHOENIX-4917) ClassCastException when projecting array elements in hash join
[ https://issues.apache.org/jira/browse/PHOENIX-4917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4917: Description: This bug was introduced in the fix for PHOENIX-4791. When projecting array elements in hash join, we now generate both ProjectedTupleValue and MultiKeyTupleValue. Previously, we were only generating ProjectedTupleValue, and there are two lines of code that contain this assumption in class casts. The fix is to merge into the MultiKeyTupleValue, while propagating the array cell as in PHOENIX-4791. The stack trace with the ClassCastException: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): org.apache.hadoop.hbase.DoNotRetryIOException: GENO_DOSE,,1537598769044.1a6cb8853b036c59e7515d8e876e28c5.: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:96) at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:62) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:300) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:294) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2633) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2837) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) Caused by: java.lang.ClassCastException: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.coprocessor.HashJoinRegionScanner.processResults(HashJoinRegionScanner.java:220) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:294) was: This bug was introduced in the fix for https://issues.apache.org/jira/browse/PHOENIX-4791. When projecting array elements in hash join, we now generate both ProjectedTupleValue and MultiKeyTupleValue. Previously, we were only generating ProjectedTupleValue, and there are two lines of code that contain this assumption in class casts. The fix is to merge into the MultiKeyTupleValue, while propagating the array cell as in PHOENIX-4791. The stack trace with the ClassCastException: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): org.apache.hadoop.hbase.DoNotRetryIOException: GENO_DOSE,,1537598769044.1a6cb8853b036c59e7515d8e876e28c5.: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:96) at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:62) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:300) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:294) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2633) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2837) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) Caused by: java.lang.ClassCastException: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.coprocessor.HashJoinRegionScanner.processResults(HashJoinRegionScanner.java:220) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:294) > ClassCastExcept
[jira] [Created] (PHOENIX-4917) ClassCastException when projecting array elements in hash join
Gerald Sangudi created PHOENIX-4917: --- Summary: ClassCastException when projecting array elements in hash join Key: PHOENIX-4917 URL: https://issues.apache.org/jira/browse/PHOENIX-4917 Project: Phoenix Issue Type: Bug Affects Versions: 4.15.0, 4.14.1 Reporter: Gerald Sangudi Assignee: Gerald Sangudi Fix For: 4.15.0, 4.14.1 This bug was introduced in the fix for https://issues.apache.org/jira/browse/PHOENIX-4791. When projecting array elements in hash join, we now generate both ProjectedTupleValue and MultiKeyTupleValue. Previously, we were only generating ProjectedTupleValue, and there are two lines of code that contain this assumption in class casts. The fix is to merge into the MultiKeyTupleValue, while propagating the array cell as in PHOENIX-4791. The stack trace with the ClassCastException: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.DoNotRetryIOException): org.apache.hadoop.hbase.DoNotRetryIOException: GENO_DOSE,,1537598769044.1a6cb8853b036c59e7515d8e876e28c5.: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.util.ServerUtil.createIOException(ServerUtil.java:96) at org.apache.phoenix.util.ServerUtil.throwIOException(ServerUtil.java:62) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:300) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.DelegateRegionScanner.nextRaw(DelegateRegionScanner.java:82) at org.apache.phoenix.coprocessor.BaseScannerRegionObserver$RegionScannerHolder.nextRaw(BaseScannerRegionObserver.java:294) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2633) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2837) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2339) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) Caused by: java.lang.ClassCastException: org.apache.phoenix.schema.tuple.MultiKeyValueTuple cannot be cast to org.apache.phoenix.execute.TupleProjector$ProjectedValueTuple at org.apache.phoenix.coprocessor.HashJoinRegionScanner.processResults(HashJoinRegionScanner.java:220) at org.apache.phoenix.coprocessor.HashJoinRegionScanner.nextRaw(HashJoinRegionScanner.java:294) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Salting based on partial rowkeys
In case the text formatting is lost below, I also added it as a comment in the JIRA ticket: https://issues.apache.org/jira/browse/PHOENIX-4757 On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi wrote: > Sorry I missed Josh's reply; I've subscribed to the dev list now. > > Below is a copy-and-paste from our internal document. Thanks in advance > for your review and additional feedback on this. > > Gerald > > > > > > > > > > > > > > > > > > > > *BackgroundWe make extensive use of multi-column rowkeys and salting > <https://phoenix.apache.org/salted.html> in our different apache phoenix > deployments. We frequently perform group-by aggregations on these data > along a specific dimension that would benefit from predictably partitioning > the data along that dimension. Proposal:We propose to add table metadata to > allow schema designers to constrain salting to a subset of the rowkey, > rather than the full rowkey as it is today. This will introduce a mechanism > to partition data on a per-table basis along a single dimension without > application changes or much change to the phoenix runtime logic. We expect > this will result in substantially faster group-by’s along the salted > dimension and negligible penalties elsewhere. This feature has also been > proposed in PHOENIX-4757 > <https://issues.apache.org/jira/browse/PHOENIX-4757> where it was pointed > out that partitioning and sorting data along different dimensions is a > common pattern in other datastores as well.Theoretically, it could cause > hotspotting when querying along the salted dimension without the leading > rowkey - that would be an anti-pattern.Usage ExampleCurrent:Schema:CREATE > TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key > BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2, > other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From > relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:> > EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3) Group > by id_2 > ;+-++| > PLAN| EST_BY > |+-++| > CLIENT 60-CHUNK PARALLEL 60-WAY SKIP SCAN ON 120 KEYS OVER RELATIONSHIP > [0,2] - [59,3] | null || SERVER AGGREGATE INTO DISTINCT ROWS BY [ID_2] > | null || CLIENT MERGE SORT > | > null > |+-++3 > rows selected (0.048 seconds)In this case, although the group by is > performed on both the client and regionserver, almost all of the actual > grouping happens on the client because the id_2’s are randomly distributed > across the regionservers. As a result, a lot of unnecessary data is > serialized to the client and grouped serially there. This can become quite > material with large resultsets.Proposed:Schema:CREATE TABLE relationship > (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key BIGINT NOT NULL,val > SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2, > other_key),SALT_BUCKETS=60,SALT_COLUMN = id_2);Query (unchanged):Select > id_2, sum(val)From relationshipWhere id_1 in (2,3)Group by id_2Explain > (unchanged)Under the proposal, the data are merely partitioned so that all > rows containing the same id_2 are on the same regionserver, the above query > will perform almost all of the grouping in parallel on the regionservers. > No special hint or changes to the query plan would be required to benefit. > Tables would need to be re-salted to take advantage of the new > functionality.Technical changes proposed to phoenix: - Create a new piece > of table-level metadata: SALT_COLUMN. SALT_COLUMN will instruct the salting > logic to generate a salt-byte based only on the specified column. If > unspecified, it will behave as it does today and default to salting the > entire rowkey. This metadata may be specified only when the table is > created and may not be modified. The specified column must be part of the > rowkey. - Modify all callers of getSaltingByte > <https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/schema/SaltingUtil.java#L77>(byte[] > value, int offset, int length, int bucketNum) to consistently leverage the > new metadata.- Tests- DocsDesign points:One salt column vs multiple salt > columns: Based on the existing signature for getSaltingByte, it seems > simpler to only support a single SALT_COLUMN rather than multiple arbi
Re: Salting based on partial rowkeys
unnecessary in cases where the SALT_COLUMN is part of the group-by expression.- A query that contains a literal equality predicate for the SALT_COLUMN can be isolated to a single regionserver, rather than broadcast to all regionservers.- A client-side merge-sort-join based on the SALT_COLUMN could optimize organization of merges. - Similarly, a server-side hash join could distribute only ‘necessary’ portions of the hash table to each regionserver.If additional advantages of these types come for free, then that’s great but can be follow on enhancements from the initial commit.* On Thu, Sep 13, 2018 at 9:33 AM, Thomas D'Silva wrote: > Gerald, > > I think you missed Josh's reply here : https://lists.apache.org/ > thread.html/c5145461805429622a410c23c1199d578e146a5c94511b2d5833438b@% > 3Cdev.phoenix.apache.org%3E > > Could you explain how using a subset of the pk columns to generate the > salt byte helps with partitioning, aggregations etc? > > Thanks, > Thomas > > On Thu, Sep 13, 2018 at 8:32 AM, Gerald Sangudi > wrote: > >> Hi folks, >> >> Any thoughts or feedback on this? >> >> Thanks, >> Gerald >> >> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi >> wrote: >> >>> Hello folks, >>> >>> We have a requirement for salting based on partial, rather than full, >>> rowkeys. My colleague Mike Polcari has identified the requirement and >>> proposed an approach. >>> >>> I found an already-open JIRA ticket for the same issue: >>> https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more >>> details from the proposal. >>> >>> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike >>> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... . >>> >>> The benefit at issue is that users gain more control over partitioning, >>> and this can be used to push some additional aggregations and hash joins >>> down to region servers. >>> >>> I would appreciate any go-ahead / thoughts / guidance / objections / >>> feedback. I'd like to be sure that the concept at least is not >>> objectionable. We would like to work on this and submit a patch down the >>> road. I'll also add a note to the JIRA ticket. >>> >>> Thanks, >>> Gerald >>> >>> >> >
Re: Salting based on partial rowkeys
Hi folks, Any thoughts or feedback on this? Thanks, Gerald On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi wrote: > Hello folks, > > We have a requirement for salting based on partial, rather than full, > rowkeys. My colleague Mike Polcari has identified the requirement and > proposed an approach. > > I found an already-open JIRA ticket for the same issue: > https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more > details from the proposal. > > The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike > proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... . > > The benefit at issue is that users gain more control over partitioning, > and this can be used to push some additional aggregations and hash joins > down to region servers. > > I would appreciate any go-ahead / thoughts / guidance / objections / > feedback. I'd like to be sure that the concept at least is not > objectionable. We would like to work on this and submit a patch down the > road. I'll also add a note to the JIRA ticket. > > Thanks, > Gerald > >
Salting based on partial rowkeys
Hello folks, We have a requirement for salting based on partial, rather than full, rowkeys. My colleague Mike Polcari has identified the requirement and proposed an approach. I found an already-open JIRA ticket for the same issue: https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more details from the proposal. The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... . The benefit at issue is that users gain more control over partitioning, and this can be used to push some additional aggregations and hash joins down to region servers. I would appreciate any go-ahead / thoughts / guidance / objections / feedback. I'd like to be sure that the concept at least is not objectionable. We would like to work on this and submit a patch down the road. I'll also add a note to the JIRA ticket. Thanks, Gerald
[jira] [Updated] (PHOENIX-4791) Array elements are nullified with joins
[ https://issues.apache.org/jira/browse/PHOENIX-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4791: Attachment: (was: PHOENIX-4791.patch) > Array elements are nullified with joins > --- > > Key: PHOENIX-4791 > URL: https://issues.apache.org/jira/browse/PHOENIX-4791 > Project: Phoenix > Issue Type: Bug >Affects Versions: 4.11.0, 4.12.0, 4.13.0, 4.14.0, 5.0.0 >Reporter: Tulasi P > Assignee: Gerald Sangudi >Priority: Major > Fix For: 4.15.0, 5.1.0 > > > Returning elements of an array from a table that is part of a join causes > array elements to be nullified. > {noformat} > create table array_test_1 (id integer not null primary key, arr tinyint[5]); > upsert into array_test_1 values (1001, array[0, 0, 0, 0, 0]); > upsert into array_test_1 values (1002, array[0, 0, 0, 0, 1]); > upsert into array_test_1 values (1003, array[0, 0, 0, 1, 1]); > upsert into array_test_1 values (1004, array[0, 0, 1, 1, 1]); > upsert into array_test_1 values (1005, array[1, 1, 1, 1, 1]); > {noformat} > > {noformat} > create table test_table_1 (id integer not null primary key, val varchar); > upsert into test_table_1 values (1001, 'abc'); > upsert into test_table_1 values (1002, 'def'); > upsert into test_table_1 values (1003, 'ghi');{noformat} > {noformat} > 0: jdbc:phoenix:localhost> select t1.id, t2.val, t1.arr[1], t1.arr[2], > t1.arr[3] from array_test_1 as t1 join test_table_1 as t2 on t1.id = t2.id; > ++-+-++---+ > | T1.ID | T2.VAL | ARRAY_ELEM(T1.ARR, 1) | ARRAY_ELEM(T1.ARR, 2) | > ARRAY_ELEM(T1.ARR, 3) | > ++-+-+-++ > | 1001 | abc | null | null | null | > | 1002 | def | null | null | null | > | 1003 | ghi | null | null | null | > ++-+++-+ > 3 rows selected (0.056 seconds) > {noformat} > However, directly selecting array elements from the array returns data > correctly. > {noformat} > 0: jdbc:phoenix:localhost> select [t1.id, t1.arr[1], t1.arr[2], t1.arr[3] > from array_test_1 as t1; > +---++-+-+ > | ID | ARRAY_ELEM(ARR, 1) | ARRAY_ELEM(ARR, 2) | ARRAY_ELEM(ARR, 3) | > +---++-+-+ > | 1001 | 0 | 0 | 0 | > | 1002 | 0 | 0 | 0 | > | 1003 | 0 | 0 | 0 | > | 1004 | 0 | 0 | 1 | > | 1005 | 1 | 1 | 1 | > +---+---+--+--+ > 5 rows selected (0.044 seconds) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4791) Array elements are nullified with joins
[ https://issues.apache.org/jira/browse/PHOENIX-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4791: Attachment: (was: PHOENIX-4791.patch) > Array elements are nullified with joins > --- > > Key: PHOENIX-4791 > URL: https://issues.apache.org/jira/browse/PHOENIX-4791 > Project: Phoenix > Issue Type: Bug >Affects Versions: 4.11.0, 4.12.0, 4.13.0, 4.14.0, 5.0.0 >Reporter: Tulasi P > Assignee: Gerald Sangudi >Priority: Major > Fix For: 4.15.0, 5.1.0 > > Attachments: PHOENIX-4791.patch > > > Returning elements of an array from a table that is part of a join causes > array elements to be nullified. > {noformat} > create table array_test_1 (id integer not null primary key, arr tinyint[5]); > upsert into array_test_1 values (1001, array[0, 0, 0, 0, 0]); > upsert into array_test_1 values (1002, array[0, 0, 0, 0, 1]); > upsert into array_test_1 values (1003, array[0, 0, 0, 1, 1]); > upsert into array_test_1 values (1004, array[0, 0, 1, 1, 1]); > upsert into array_test_1 values (1005, array[1, 1, 1, 1, 1]); > {noformat} > > {noformat} > create table test_table_1 (id integer not null primary key, val varchar); > upsert into test_table_1 values (1001, 'abc'); > upsert into test_table_1 values (1002, 'def'); > upsert into test_table_1 values (1003, 'ghi');{noformat} > {noformat} > 0: jdbc:phoenix:localhost> select t1.id, t2.val, t1.arr[1], t1.arr[2], > t1.arr[3] from array_test_1 as t1 join test_table_1 as t2 on t1.id = t2.id; > ++-+-++---+ > | T1.ID | T2.VAL | ARRAY_ELEM(T1.ARR, 1) | ARRAY_ELEM(T1.ARR, 2) | > ARRAY_ELEM(T1.ARR, 3) | > ++-+-+-++ > | 1001 | abc | null | null | null | > | 1002 | def | null | null | null | > | 1003 | ghi | null | null | null | > ++-+++-+ > 3 rows selected (0.056 seconds) > {noformat} > However, directly selecting array elements from the array returns data > correctly. > {noformat} > 0: jdbc:phoenix:localhost> select [t1.id, t1.arr[1], t1.arr[2], t1.arr[3] > from array_test_1 as t1; > +---++-+-+ > | ID | ARRAY_ELEM(ARR, 1) | ARRAY_ELEM(ARR, 2) | ARRAY_ELEM(ARR, 3) | > +---++-+-+ > | 1001 | 0 | 0 | 0 | > | 1002 | 0 | 0 | 0 | > | 1003 | 0 | 0 | 0 | > | 1004 | 0 | 0 | 1 | > | 1005 | 1 | 1 | 1 | > +---+---+--+--+ > 5 rows selected (0.044 seconds) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4791) Array elements are nullified with joins
[ https://issues.apache.org/jira/browse/PHOENIX-4791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4791: Attachment: (was: PHOENIX-4791.patch) > Array elements are nullified with joins > --- > > Key: PHOENIX-4791 > URL: https://issues.apache.org/jira/browse/PHOENIX-4791 > Project: Phoenix > Issue Type: Bug >Affects Versions: 4.11.0, 4.12.0, 4.13.0, 4.14.0, 5.0.0 >Reporter: Tulasi P >Priority: Major > Fix For: 4.15.0, 5.1.0 > > Attachments: PHOENIX-4791.patch > > > Returning elements of an array from a table that is part of a join causes > array elements to be nullified. > {noformat} > create table array_test_1 (id integer not null primary key, arr tinyint[5]); > upsert into array_test_1 values (1001, array[0, 0, 0, 0, 0]); > upsert into array_test_1 values (1002, array[0, 0, 0, 0, 1]); > upsert into array_test_1 values (1003, array[0, 0, 0, 1, 1]); > upsert into array_test_1 values (1004, array[0, 0, 1, 1, 1]); > upsert into array_test_1 values (1005, array[1, 1, 1, 1, 1]); > {noformat} > > {noformat} > create table test_table_1 (id integer not null primary key, val varchar); > upsert into test_table_1 values (1001, 'abc'); > upsert into test_table_1 values (1002, 'def'); > upsert into test_table_1 values (1003, 'ghi');{noformat} > {noformat} > 0: jdbc:phoenix:localhost> select t1.id, t2.val, t1.arr[1], t1.arr[2], > t1.arr[3] from array_test_1 as t1 join test_table_1 as t2 on t1.id = t2.id; > ++-+-++---+ > | T1.ID | T2.VAL | ARRAY_ELEM(T1.ARR, 1) | ARRAY_ELEM(T1.ARR, 2) | > ARRAY_ELEM(T1.ARR, 3) | > ++-+-+-++ > | 1001 | abc | null | null | null | > | 1002 | def | null | null | null | > | 1003 | ghi | null | null | null | > ++-+++-+ > 3 rows selected (0.056 seconds) > {noformat} > However, directly selecting array elements from the array returns data > correctly. > {noformat} > 0: jdbc:phoenix:localhost> select [t1.id, t1.arr[1], t1.arr[2], t1.arr[3] > from array_test_1 as t1; > +---++-+-+ > | ID | ARRAY_ELEM(ARR, 1) | ARRAY_ELEM(ARR, 2) | ARRAY_ELEM(ARR, 3) | > +---++-+-+ > | 1001 | 0 | 0 | 0 | > | 1002 | 0 | 0 | 0 | > | 1003 | 0 | 0 | 0 | > | 1004 | 0 | 0 | 1 | > | 1005 | 1 | 1 | 1 | > +---+---+--+--+ > 5 rows selected (0.044 seconds) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Null array elements with joins
Hi folks, I submitted PR https://github.com/apache/phoenix/pull/335 Appreciate your review. Thanks, Gerald On Mon, Aug 13, 2018 at 1:10 PM, James Taylor wrote: > I commented on the JIRA you filed here: PHOENIX-4791. Best to keep > discussion there. > Thanks, > James > > On Mon, Aug 13, 2018 at 11:08 AM, Gerald Sangudi > wrote: > >> Hello all, >> >> Any suggestions or pointers on the issue below? >> >> Projecting array elements works when not using joins, and does not work >> when we use hash joins. Is there an issue with the ProjectionCompiler for >> joins? I have not been able to isolate the specific cause, and would >> appreciate any pointers or suggestions. >> >> Thanks, >> Gerald >> >> On Tue, Jun 19, 2018 at 10:02 AM, Tulasi Paradarami < >> tulasi.krishn...@gmail.com> wrote: >> >>> Hi, >>> >>> I'm running few tests against Phoenix array and running into this bug >>> where array elements return null values when a join is involved. Is this a >>> known issue/limitation of arrays? >>> >>> create table array_test_1 (id integer not null primary key, arr >>> tinyint[5]); >>> upsert into array_test_1 values (1001, array[0, 0, 0, 0, 0]); >>> upsert into array_test_1 values (1002, array[0, 0, 0, 0, 1]); >>> upsert into array_test_1 values (1003, array[0, 0, 0, 1, 1]); >>> upsert into array_test_1 values (1004, array[0, 0, 1, 1, 1]); >>> upsert into array_test_1 values (1005, array[1, 1, 1, 1, 1]); >>> >>> create table test_table_1 (id integer not null primary key, val varchar); >>> upsert into test_table_1 values (1001, 'abc'); >>> upsert into test_table_1 values (1002, 'def'); >>> upsert into test_table_1 values (1003, 'ghi'); >>> >>> 0: jdbc:phoenix:localhost> select t1.id, t2.val, t1.arr[1], t1.arr[2], >>> t1.arr[3] from array_test_1 as t1 join test_table_1 as t2 on t1.id = >>> t2.id; >>> ++-++--- >>> -++ >>> | T1.ID | T2.VAL | ARRAY_ELEM(T1.ARR, 1) | ARRAY_ELEM(T1.ARR, 2) | >>> ARRAY_ELEM(T1.ARR, 3) | >>> ++-++--- >>> -++ >>> | 1001 | abc | null | null | >>> null | >>> | 1002 | def | null | null | >>> null | >>> | 1003 | ghi | null | null | >>> null | >>> ++-++--- >>> -++ >>> 3 rows selected (0.056 seconds) >>> >>> However, directly selecting array elements from the array returns data >>> correctly. >>> 0: jdbc:phoenix:localhost> select t1.id, t1.arr[1], t1.arr[2], >>> t1.arr[3] from array_test_1 as t1; >>> +---+-+-+--- >>> --+ >>> | ID | ARRAY_ELEM(ARR, 1) | ARRAY_ELEM(ARR, 2) | ARRAY_ELEM(ARR, >>> 3) | >>> +---+-+-+--- >>> --+ >>> | 1001 | 0 | 0 | 0 >>> | >>> | 1002 | 0 | 0 | 0 >>> | >>> | 1003 | 0 | 0 | 0 >>> | >>> | 1004 | 0 | 0 | 1 >>> | >>> | 1005 | 1 | 1 | 1 >>> | >>> +---+-+-+--- >>> --+ >>> 5 rows selected (0.044 seconds) >>> >>> >>> >> >
Re: Null array elements with joins
Hello all, Any suggestions or pointers on the issue below? Projecting array elements works when not using joins, and does not work when we use hash joins. Is there an issue with the ProjectionCompiler for joins? I have not been able to isolate the specific cause, and would appreciate any pointers or suggestions. Thanks, Gerald On Tue, Jun 19, 2018 at 10:02 AM, Tulasi Paradarami < tulasi.krishn...@gmail.com> wrote: > Hi, > > I'm running few tests against Phoenix array and running into this bug > where array elements return null values when a join is involved. Is this a > known issue/limitation of arrays? > > create table array_test_1 (id integer not null primary key, arr > tinyint[5]); > upsert into array_test_1 values (1001, array[0, 0, 0, 0, 0]); > upsert into array_test_1 values (1002, array[0, 0, 0, 0, 1]); > upsert into array_test_1 values (1003, array[0, 0, 0, 1, 1]); > upsert into array_test_1 values (1004, array[0, 0, 1, 1, 1]); > upsert into array_test_1 values (1005, array[1, 1, 1, 1, 1]); > > create table test_table_1 (id integer not null primary key, val varchar); > upsert into test_table_1 values (1001, 'abc'); > upsert into test_table_1 values (1002, 'def'); > upsert into test_table_1 values (1003, 'ghi'); > > 0: jdbc:phoenix:localhost> select t1.id, t2.val, t1.arr[1], t1.arr[2], > t1.arr[3] from array_test_1 as t1 join test_table_1 as t2 on t1.id = t2.id > ; > ++-++--- > -++ > | T1.ID | T2.VAL | ARRAY_ELEM(T1.ARR, 1) | ARRAY_ELEM(T1.ARR, 2) | > ARRAY_ELEM(T1.ARR, 3) | > ++-++--- > -++ > | 1001 | abc | null | null | > null | > | 1002 | def | null | null | > null | > | 1003 | ghi | null | null | > null | > ++-++--- > -++ > 3 rows selected (0.056 seconds) > > However, directly selecting array elements from the array returns data > correctly. > 0: jdbc:phoenix:localhost> select t1.id, t1.arr[1], t1.arr[2], t1.arr[3] > from array_test_1 as t1; > +---+-+-+--- > --+ > | ID | ARRAY_ELEM(ARR, 1) | ARRAY_ELEM(ARR, 2) | ARRAY_ELEM(ARR, 3) | > +---+-+-+--- > --+ > | 1001 | 0 | 0 | 0 | > | 1002 | 0 | 0 | 0 | > | 1003 | 0 | 0 | 0 | > | 1004 | 0 | 0 | 1 | > | 1005 | 1 | 1 | 1 | > +---+-+-+--- > --+ > 5 rows selected (0.044 seconds) > > >
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: 0001-PHOENIX-4751-Implement-client-side-hash-aggre.master.patch > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > Attachments: > 0001-PHOENIX-4751-Add-HASH_AGGREGATE-hint.4.x-HBase-1.4.patch, > 0001-PHOENIX-4751-Implement-client-side-has.4.x-HBase-1.4.patch, > 0001-PHOENIX-4751-Implement-client-side-hash-aggre.master.patch, > 0002-PHOENIX-4751-Begin-implementation-of-c.4.x-HBase-1.4.patch, > 0003-PHOENIX-4751-Generated-aggregated-resu.4.x-HBase-1.4.patch, > 0004-PHOENIX-4751-Sort-results-of-client-ha.4.x-HBase-1.4.patch, > 0005-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch, > 0006-PHOENIX-4751-Fix-and-run-integration-t.4.x-HBase-1.4.patch, > 0007-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch, > 0008-PHOENIX-4751-Verify-EXPLAIN-plan-for-b.4.x-HBase-1.4.patch, > 0009-PHOENIX-4751-Standardize-null-checks-a.4.x-HBase-1.4.patch, > 0010-PHOENIX-4751-Abort-when-client-aggrega.4.x-HBase-1.4.patch, > 0011-PHOENIX-4751-Use-Phoenix-memory-mgmt-t.4.x-HBase-1.4.patch, > 0012-PHOENIX-4751-Remove-extra-memory-limit.4.x-HBase-1.4.patch, > 0013-PHOENIX-4751-Sort-only-when-necessary.4.x-HBase-1.4.patch, > 0014-PHOENIX-4751-Sort-only-when-necessary-.4.x-HBase-1.4.patch, > 0015-PHOENIX-4751-Show-client-hash-aggregat.4.x-HBase-1.4.patch, > 0016-PHOENIX-4751-Handle-reverse-sort-add-c.4.x-HBase-1.4.patch > > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: 0001-PHOENIX-4751-Implement-client-side-has.4.x-HBase-1.4.patch > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > Attachments: > 0001-PHOENIX-4751-Add-HASH_AGGREGATE-hint.4.x-HBase-1.4.patch, > 0001-PHOENIX-4751-Implement-client-side-has.4.x-HBase-1.4.patch, > 0002-PHOENIX-4751-Begin-implementation-of-c.4.x-HBase-1.4.patch, > 0003-PHOENIX-4751-Generated-aggregated-resu.4.x-HBase-1.4.patch, > 0004-PHOENIX-4751-Sort-results-of-client-ha.4.x-HBase-1.4.patch, > 0005-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch, > 0006-PHOENIX-4751-Fix-and-run-integration-t.4.x-HBase-1.4.patch, > 0007-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch, > 0008-PHOENIX-4751-Verify-EXPLAIN-plan-for-b.4.x-HBase-1.4.patch, > 0009-PHOENIX-4751-Standardize-null-checks-a.4.x-HBase-1.4.patch, > 0010-PHOENIX-4751-Abort-when-client-aggrega.4.x-HBase-1.4.patch, > 0011-PHOENIX-4751-Use-Phoenix-memory-mgmt-t.4.x-HBase-1.4.patch, > 0012-PHOENIX-4751-Remove-extra-memory-limit.4.x-HBase-1.4.patch, > 0013-PHOENIX-4751-Sort-only-when-necessary.4.x-HBase-1.4.patch, > 0014-PHOENIX-4751-Sort-only-when-necessary-.4.x-HBase-1.4.patch, > 0015-PHOENIX-4751-Show-client-hash-aggregat.4.x-HBase-1.4.patch, > 0016-PHOENIX-4751-Handle-reverse-sort-add-c.4.x-HBase-1.4.patch > > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: 0016-PHOENIX-4751-Handle-reverse-sort-add-c.4.x-HBase-1.4.patch > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > Attachments: > 0001-PHOENIX-4751-Add-HASH_AGGREGATE-hint.4.x-HBase-1.4.patch, > 0002-PHOENIX-4751-Begin-implementation-of-c.4.x-HBase-1.4.patch, > 0003-PHOENIX-4751-Generated-aggregated-resu.4.x-HBase-1.4.patch, > 0004-PHOENIX-4751-Sort-results-of-client-ha.4.x-HBase-1.4.patch, > 0005-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch, > 0006-PHOENIX-4751-Fix-and-run-integration-t.4.x-HBase-1.4.patch, > 0007-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch, > 0008-PHOENIX-4751-Verify-EXPLAIN-plan-for-b.4.x-HBase-1.4.patch, > 0009-PHOENIX-4751-Standardize-null-checks-a.4.x-HBase-1.4.patch, > 0010-PHOENIX-4751-Abort-when-client-aggrega.4.x-HBase-1.4.patch, > 0011-PHOENIX-4751-Use-Phoenix-memory-mgmt-t.4.x-HBase-1.4.patch, > 0012-PHOENIX-4751-Remove-extra-memory-limit.4.x-HBase-1.4.patch, > 0013-PHOENIX-4751-Sort-only-when-necessary.4.x-HBase-1.4.patch, > 0014-PHOENIX-4751-Sort-only-when-necessary-.4.x-HBase-1.4.patch, > 0015-PHOENIX-4751-Show-client-hash-aggregat.4.x-HBase-1.4.patch, > 0016-PHOENIX-4751-Handle-reverse-sort-add-c.4.x-HBase-1.4.patch > > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: 0015-PHOENIX-4751-Show-client-hash-aggregat.4.x-HBase-1.4.patch 0014-PHOENIX-4751-Sort-only-when-necessary-.4.x-HBase-1.4.patch 0013-PHOENIX-4751-Sort-only-when-necessary.4.x-HBase-1.4.patch 0012-PHOENIX-4751-Remove-extra-memory-limit.4.x-HBase-1.4.patch 0011-PHOENIX-4751-Use-Phoenix-memory-mgmt-t.4.x-HBase-1.4.patch 0010-PHOENIX-4751-Abort-when-client-aggrega.4.x-HBase-1.4.patch 0009-PHOENIX-4751-Standardize-null-checks-a.4.x-HBase-1.4.patch 0008-PHOENIX-4751-Verify-EXPLAIN-plan-for-b.4.x-HBase-1.4.patch 0007-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch 0006-PHOENIX-4751-Fix-and-run-integration-t.4.x-HBase-1.4.patch 0005-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch 0004-PHOENIX-4751-Sort-results-of-client-ha.4.x-HBase-1.4.patch 0003-PHOENIX-4751-Generated-aggregated-resu.4.x-HBase-1.4.patch 0002-PHOENIX-4751-Begin-implementation-of-c.4.x-HBase-1.4.patch 0001-PHOENIX-4751-Add-HASH_AGGREGATE-hint.4.x-HBase-1.4.patch > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > Attachments: > 0001-PHOENIX-4751-Add-HASH_AGGREGATE-hint.4.x-HBase-1.4.patch, > 0002-PHOENIX-4751-Begin-implementation-of-c.4.x-HBase-1.4.patch, > 0003-PHOENIX-4751-Generated-aggregated-resu.4.x-HBase-1.4.patch, > 0004-PHOENIX-4751-Sort-results-of-client-ha.4.x-HBase-1.4.patch, > 0005-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch, > 0006-PHOENIX-4751-Fix-and-run-integration-t.4.x-HBase-1.4.patch, > 0007-PHOENIX-4751-Add-integration-test-for-.4.x-HBase-1.4.patch, > 0008-PHOENIX-4751-Verify-EXPLAIN-plan-for-b.4.x-HBase-1.4.patch, > 0009-PHOENIX-4751-Standardize-null-checks-a.4.x-HBase-1.4.patch, > 0010-PHOENIX-4751-Abort-when-client-aggrega.4.x-HBase-1.4.patch, > 0011-PHOENIX-4751-Use-Phoenix-memory-mgmt-t.4.x-HBase-1.4.patch, > 0012-PHOENIX-4751-Remove-extra-memory-limit.4.x-HBase-1.4.patch, > 0013-PHOENIX-4751-Sort-only-when-necessary.4.x-HBase-1.4.patch, > 0014-PHOENIX-4751-Sort-only-when-necessary-.4.x-HBase-1.4.patch, > 0015-PHOENIX-4751-Show-client-hash-aggregat.4.x-HBase-1.4.patch > > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: (was: PHOENIX-4751-4.x-HBase-1.4.patch) > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: (was: PHOENIX-4751.patch) > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > Attachments: PHOENIX-4751-4.x-HBase-1.4.patch > > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: (was: PHOENIX-4751.patch) > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > Attachments: PHOENIX-4751-4.x-HBase-1.4.patch > > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: PHOENIX-4751-4.x-HBase-1.4.patch > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > Attachments: PHOENIX-4751-4.x-HBase-1.4.patch > > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Attachment: PHOENIX-4751.patch > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi > Assignee: Gerald Sangudi >Priority: Major > Attachments: PHOENIX-4751.patch > > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537785#comment-16537785 ] Gerald Sangudi commented on PHOENIX-4751: - [~sergey.soldatov] I removed those unnecessary commits using (git rebase -i + git push --force). Lmk if I need to do anything else. Thanks. > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16537722#comment-16537722 ] Gerald Sangudi commented on PHOENIX-4751: - Hi [~sergey.soldatov], how do I do that? Is there a link? I'll also check docs / Google. Those commits came in because I wanted to keep the patch in sync with upstream. > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530425#comment-16530425 ] Gerald Sangudi edited comment on PHOENIX-4751 at 7/2/18 8:51 PM: - Ready for review. Pull request https://github.com/apache/phoenix/pull/308 The hashing is done in memory, instead of using SpillableGroupByCache. This is consistent with the current client-side in-memory sort, and avoids any dependencies in SpillableGroupByCache. was (Author: sangudi): Ready for review. Pull request https://github.com/apache/phoenix/pull/308 > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530425#comment-16530425 ] Gerald Sangudi commented on PHOENIX-4751: - Ready for review. Pull request https://github.com/apache/phoenix/pull/308 > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Description: A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash aggregation in some cases, for improved performance. When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash aggregation. It instead performs a CLIENT SORT followed by a CLIENT AGGREGATE. The performance can be improved if (a) the GROUP BY output does not need to be sorted, and (b) the GROUP BY input is large enough and has low cardinality. The hash aggregation can initially be a hint. Here is an example from Phoenix 4.13.1 that would benefit from hash aggregation if the GROUP BY input is large with low cardinality. CREATE TABLE unsalted ( keyA BIGINT NOT NULL, keyB BIGINT NOT NULL, val SMALLINT, CONSTRAINT pk PRIMARY KEY (keyA, keyB) ); EXPLAIN SELECT /*+ USE_SORT_MERGE_JOIN */ t1.val v1, t2.val v2, COUNT(\*) c FROM unsalted t1 JOIN unsalted t2 ON (t1.keyA = t2.keyA) GROUP BY t1.val, t2.val; +-+++--+ |PLAN|EST_BYTES_READ|EST_ROWS_READ| | +-+++--+ |SORT-MERGE-JOIN (INNER) TABLES|null|null| | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |AND|null|null| | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | +-+++--+ was: A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash aggregation in some cases, for improved performance. When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash aggregation. It instead performs a CLIENT SORT followed by a CLIENT AGGREGATE. The performance can be improved if (a) the GROUP BY output does not need to be sorted, and (b) the GROUP BY input is large enough and has low cardinality. The hash aggregation can initially be a hint. Here is an example from Phoenix 4.13.1 that would benefit from hash aggregation if the GROUP BY input is large with low cardinality. CREATE TABLE unsalted ( keyA BIGINT NOT NULL, keyB BIGINT NOT NULL, val SMALLINT, CONSTRAINT pk PRIMARY KEY (keyA, keyB) ); EXPLAIN SELECT /*+ USE_SORT_MERGE_JOIN */ t1.val v1, t2.val v2, COUNT(*) c FROM unsalted t1 JOIN unsalted t2 ON (t1.keyA = t2.keyA) GROUP BY t1.val, t2.val; +-+++--+ |PLAN|EST_BYTES_READ|EST_ROWS_READ| | +-+++--+ |SORT-MERGE-JOIN (INNER) TABLES|null|null| | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |AND|null|null| | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | +-+++--+ > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +---
[jira] [Commented] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16488387#comment-16488387 ] Gerald Sangudi commented on PHOENIX-4751: - [~jamestaylor], my understanding is that SORT-MERGE-JOIN takes place on the client. In the EXPLAIN plan, I see CLIENT SCANs as part of the SORT-MERGE-JOIN. If that is the case, the results of the SORT-MERGE-JOIN might be in client-side memory before performing the GROUP BY. Depending on the data size, would the client be able to aggregate these faster than writing them back to a temp table on the region servers for aggregation? (2) Is SpillableGroupByCache currently used anywhere, e.g. in server-side hash aggregation? > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SMALLINT, > CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > > +-+++--+ > |PLAN|EST_BYTES_READ|EST_ROWS_READ| | > +-+++--+ > |SORT-MERGE-JOIN (INNER) TABLES|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |AND|null|null| | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | > |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | > |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | > +-+++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Description: A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash aggregation in some cases, for improved performance. When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash aggregation. It instead performs a CLIENT SORT followed by a CLIENT AGGREGATE. The performance can be improved if (a) the GROUP BY output does not need to be sorted, and (b) the GROUP BY input is large enough and has low cardinality. The hash aggregation can initially be a hint. Here is an example from Phoenix 4.13.1 that would benefit from hash aggregation if the GROUP BY input is large with low cardinality. CREATE TABLE unsalted ( keyA BIGINT NOT NULL, keyB BIGINT NOT NULL, val SMALLINT, CONSTRAINT pk PRIMARY KEY (keyA, keyB) ); EXPLAIN SELECT /*+ USE_SORT_MERGE_JOIN */ t1.val v1, t2.val v2, COUNT(*) c FROM unsalted t1 JOIN unsalted t2 ON (t1.keyA = t2.keyA) GROUP BY t1.val, t2.val; +-+++--+ |PLAN|EST_BYTES_READ|EST_ROWS_READ| | +-+++--+ |SORT-MERGE-JOIN (INNER) TABLES|null|null| | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |AND|null|null| | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | +-+++--+ was: A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash aggregation in some cases, for improved performance. When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash aggregation. It instead performs a CLIENT SORT followed by a CLIENT AGGREGATE. The performance can be improved if (a) the GROUP BY output does not need to be sorted, and (b) the GROUP BY input is large enough and has low cardinality. The hash aggregation can initially be a hint. Here is an example from Phoenix 4.13.1 that would benefit from hash aggregation if the GROUP BY input is large with low cardinality. CREATE TABLE unsalted ( keyA BIGINT NOT NULL, keyB BIGINT NOT NULL, val SMALLINT, CONSTRAINT pk PRIMARY KEY (keyA, keyB) ); EXPLAIN SELECT /*+ USE_SORT_MERGE_JOIN */ t1.val v1, t2.val v2, COUNT(\*) c FROM unsalted t1 JOIN unsalted t2 ON (t1.keyA = t2.keyA) GROUP BY t1.val, t2.val; ++-++--+ |PLAN| EST_BYTES_READ | EST_ROWS_READ | | ++-++--+ | SORT-MERGE-JOIN (INNER) TABLES | null | null | | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED | null | null | | | AND| null | null | | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED | null | null | | | CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL] | null | null | | | CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]| null | null | | ++-++--+ > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( > keyA BIGINT NOT NULL, > keyB BIGINT NOT NULL, > val SM
[jira] [Updated] (PHOENIX-4751) Support client-side hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Summary: Support client-side hash aggregation with SORT_MERGE_JOIN (was: Support hash aggregation with SORT_MERGE_JOIN) > Support client-side hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( >keyA BIGINT NOT NULL, >keyB BIGINT NOT NULL, >val SMALLINT, >CONSTRAINT pk PRIMARY KEY (keyA, keyB) > ); > EXPLAIN > SELECT /*+ USE_SORT_MERGE_JOIN */ > t1.val v1, t2.val v2, COUNT(\*) c > FROM unsalted t1 JOIN unsalted t2 > ON (t1.keyA = t2.keyA) > GROUP BY t1.val, t2.val; > ++-++--+ > |PLAN| EST_BYTES_READ > | EST_ROWS_READ | | > ++-++--+ > | SORT-MERGE-JOIN (INNER) TABLES | null > | null | | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED | null > | null | | > | AND| null > | null | | > | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED | null > | null | | > | CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL] | null > | null | | > | CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]| null > | null | | > ++-++--+ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (PHOENIX-4751) Support hash aggregation with SORT_MERGE_JOIN
[ https://issues.apache.org/jira/browse/PHOENIX-4751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gerald Sangudi updated PHOENIX-4751: Description: A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash aggregation in some cases, for improved performance. When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash aggregation. It instead performs a CLIENT SORT followed by a CLIENT AGGREGATE. The performance can be improved if (a) the GROUP BY output does not need to be sorted, and (b) the GROUP BY input is large enough and has low cardinality. The hash aggregation can initially be a hint. Here is an example from Phoenix 4.13.1 that would benefit from hash aggregation if the GROUP BY input is large with low cardinality. CREATE TABLE unsalted ( keyA BIGINT NOT NULL, keyB BIGINT NOT NULL, val SMALLINT, CONSTRAINT pk PRIMARY KEY (keyA, keyB) ); EXPLAIN SELECT /*+ USE_SORT_MERGE_JOIN */ t1.val v1, t2.val v2, COUNT(\*) c FROM unsalted t1 JOIN unsalted t2 ON (t1.keyA = t2.keyA) GROUP BY t1.val, t2.val; ++-++--+ |PLAN| EST_BYTES_READ | EST_ROWS_READ | | ++-++--+ | SORT-MERGE-JOIN (INNER) TABLES | null | null | | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED | null | null | | | AND| null | null | | | CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED | null | null | | | CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL] | null | null | | | CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]| null | null | | ++-++--+ was: A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash aggregation in some cases, for improved performance. When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash aggregation. It instead performs a CLIENT SORT followed by a CLIENT AGGREGATE. The performance can be improved if (a) the GROUP BY output does not need to be sorted, and (b) the GROUP BY input is large enough and has low cardinality. The hash aggregation can initially be a hint. Here is an example from Phoenix 4.13.1 that would benefit from hash aggregation if the GROUP BY input is large with low cardinality. {{CREATE TABLE unsalted (}} {{ keyA BIGINT NOT NULL,}} {{ keyB BIGINT NOT NULL,}} {{ val SMALLINT,}} {{ CONSTRAINT pk PRIMARY KEY (keyA, keyB)}} {{ );}} {{EXPLAIN}} {{ SELECT /*+ USE_SORT_MERGE_JOIN */ }} {{ t1.val v1, t2.val v2, COUNT(*) c }} {{ FROM unsalted t1 JOIN unsalted t2 }} {{ ON (t1.keyA = t2.keyA) }} {{ GROUP BY t1.val, t2.val;}} {{ +-+++--+}}{{}} |PLAN|EST_BYTES_READ|EST_ROWS_READ| | {{}}{{+-+++--+}}{{}} |SORT-MERGE-JOIN (INNER) TABLES|null|null| | |CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |AND|null|null| | |CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | {{}}{{+-+++--+}} > Support hash aggregation with SORT_MERGE_JOIN > - > > Key: PHOENIX-4751 > URL: https://issues.apache.org/jira/browse/PHOENIX-4751 > Project: Phoenix > Issue Type: Improvement >Affects Versions: 4.14.0, 4.13.1 > Reporter: Gerald Sangudi >Priority: Major > > A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash > aggregation in some cases, for improved performance. > When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash > aggregation. It instead performs a CLIENT SORT followed by a CLIENT > AGGREGATE. The performance can be improved if (a) the GROUP BY output does > not need to be sorted, and (b) the GROUP BY input is large enough and has low > cardinality. > The hash aggregation can initially be a hint. Here is an example from Phoenix > 4.13.1 that would benefit from hash aggregation if the GROUP BY input is > large with low cardinality. > CREATE TABLE unsalted ( >keyA BIGINT NOT NULL, >
[jira] [Created] (PHOENIX-4751) Support hash aggregation with SORT_MERGE_JOIN
Gerald Sangudi created PHOENIX-4751: --- Summary: Support hash aggregation with SORT_MERGE_JOIN Key: PHOENIX-4751 URL: https://issues.apache.org/jira/browse/PHOENIX-4751 Project: Phoenix Issue Type: Improvement Affects Versions: 4.14.0, 4.13.1 Reporter: Gerald Sangudi A GROUP BY that follows a SORT_MERGE_JOIN should be able to use hash aggregation in some cases, for improved performance. When a GROUP BY follows a SORT_MERGE_JOIN, the GROUP BY does not use hash aggregation. It instead performs a CLIENT SORT followed by a CLIENT AGGREGATE. The performance can be improved if (a) the GROUP BY output does not need to be sorted, and (b) the GROUP BY input is large enough and has low cardinality. The hash aggregation can initially be a hint. Here is an example from Phoenix 4.13.1 that would benefit from hash aggregation if the GROUP BY input is large with low cardinality. {{CREATE TABLE unsalted (}} {{ keyA BIGINT NOT NULL,}} {{ keyB BIGINT NOT NULL,}} {{ val SMALLINT,}} {{ CONSTRAINT pk PRIMARY KEY (keyA, keyB)}} {{ );}} {{EXPLAIN}} {{ SELECT /*+ USE_SORT_MERGE_JOIN */ }} {{ t1.val v1, t2.val v2, COUNT(*) c }} {{ FROM unsalted t1 JOIN unsalted t2 }} {{ ON (t1.keyA = t2.keyA) }} {{ GROUP BY t1.val, t2.val;}} {{ +-+++--+}}{{}} |PLAN|EST_BYTES_READ|EST_ROWS_READ| | {{}}{{+-+++--+}}{{}} |SORT-MERGE-JOIN (INNER) TABLES|null|null| | |CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |AND|null|null| | |CLIENT 1-CHUNK PARALLEL 1-WAY FULL SCAN OVER UNSALTED|null|null| | |CLIENT SORTED BY [TO_DECIMAL(T1.VAL), T2.VAL]|null|null| | |CLIENT AGGREGATE INTO DISTINCT ROWS BY [T1.VAL, T2.VAL]|null|null| | {{}}{{+-+++--+}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)