Looks like the second query is sorting the entire PEOPLE table (though it seems like that shouldn’t be necessary as it’s probably already sorted by PEOPLE_ID) while the first one is sorting only part of MY_TABLE (which is likely less data). Might be a bug as the queries look the same.
Please log a JIRA and thanks for all the details. On Wed, Dec 27, 2017 at 3:44 PM Flavio Pompermaier <pomperma...@okkam.it> wrote: > Ok. So why the 2nd query requires more memory than the first one > (nonetheless USE_SORT_MERGE_JOIN is used) and can't complete? > > > On 28 Dec 2017 00:33, "James Taylor" <jamestay...@apache.org> wrote: > > A hash join (the default) will be faster but the tables being cached (last > or RHS table being joined) must be small enough to fit into memory on the > region server. If it's too big, you can use the USE_SORT_MERGE_JOIN which > would not have this restriction. > > On Wed, Dec 27, 2017 at 3:16 PM, Flavio Pompermaier <pomperma...@okkam.it> > wrote: > >> Just to summarize things...is the best approach, in terms of required >> memory, for Apache Phoenix queries to use sort merge join? Should inner >> queries be avoided? >> >> >> On 22 Dec 2017 22:47, "Flavio Pompermaier" <pomperma...@okkam.it> wrote: >> >> MYTABLE is definitely much bigger than PEOPLE table, in terms of >> cardinality. In terms of cells (rows x columns) PEOPLE is probably bigger >> >> On 22 Dec 2017 22:36, "Ethan" <ew...@apache.org> wrote: >> >>> I see. I think client side probably hold on to the iterators from the >>> both sides and crawling forward to do the merge sort. in this case should >>> be no much memory footprint either way where the filter is performed. >>> >>> On December 22, 2017 at 1:04:18 PM, James Taylor (jamestay...@apache.org) >>> wrote: >>> >>> There’s no shipping of any tables with a sort merge join. >>> >>> On Fri, Dec 22, 2017 at 1:02 PM Ethan Wang <aerto...@gmail.com> wrote: >>> >>>> I see. Looks like it's possible the rhs (MYTABLE) is too big to ship >>>> around without get filtered first. Just for experiment, if you took out >>>> hint USE_SORT_MERGE_JOIN, what will be the plan? >>>> >>>> >>>> On December 22, 2017 at 12:46:25 PM, James Taylor ( >>>> jamestay...@apache.org) wrote: >>>> >>>> For sort merge join, both post-filtered table results are sorted on the >>>> server side and then a merge sort is done on the client-side. >>>> >>>> On Fri, Dec 22, 2017 at 12:44 PM, Ethan <ew...@apache.org> wrote: >>>> >>>>> Hello Flavio, >>>>> >>>>> From the plan looks like to me the second query is doing the filter at >>>>> parent table (PEOPLE). So what is the size of your PEOPLE and MYTABLE >>>>> (after filtered) respectively? >>>>> >>>>> For sort merge join, anyone knows are the both sides get shipped to >>>>> client to do the merge sort? >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> On December 22, 2017 at 9:58:30 AM, Flavio Pompermaier ( >>>>> pomperma...@okkam.it) wrote: >>>>> >>>>> Any help here...? >>>>> >>>>> On 20 Dec 2017 17:58, "Flavio Pompermaier" <pomperma...@okkam.it> >>>>> wrote: >>>>> >>>>>> Hi to all, >>>>>> I'm trying to find the best query for my use case but I found that >>>>>> one version work and the other one does not (unless that I don't apply >>>>>> some >>>>>> tuning to timeouts etc like explained in [1]). >>>>>> >>>>>> The 2 queries extract the same data but, while the first query >>>>>> terminates the second does not. >>>>>> *PS*: without the USE_SORT_MERGE_JOIN both queries weren't working >>>>>> >>>>>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*) >>>>>> FROM PEOPLE ds JOIN MYTABLE l ON ds.PERSON_ID = l.LOCALID >>>>>> WHERE l.EID IS NULL AND l.DSID = 'PEOPLE' AND l.HAS_CANDIDATES = >>>>>> FALSE; >>>>>> >>>>>> +---------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ >>>>>> | PLAN >>>>>> | EST_BYTES_READ | >>>>>> EST_ROWS_READ | EST_INFO_TS | >>>>>> >>>>>> +---------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ >>>>>> | SORT-MERGE-JOIN (INNER) TABLES >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT 42-CHUNK 6168903 ROWS 11324622221 BYTES PARALLEL 3-WAY >>>>>> FULL SCAN OVER PEOPLE | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | SERVER FILTER BY FIRST KEY ONLY >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT MERGE SORT >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | AND (SKIP MERGE) >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT 15-CHUNK 5908964 ROWS 2831155679 BYTES PARALLEL 15-WAY >>>>>> RANGE SCAN OVER MYTABLE [0] - [2] | 14155777900 <(415)%20577-7900> >>>>>> | 12077867 | 1513754378759 | >>>>>> | SERVER FILTER BY (EID IS NULL AND DSID = 'PEOPLE' AND >>>>>> HAS_CANDIDATES = false) | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | SERVER SORTED BY [L.LOCALID] >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT MERGE SORT >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT AGGREGATE INTO SINGLE ROW >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> >>>>>> +---------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ >>>>>> >>>>>> >>>>>> SELECT /*+ USE_SORT_MERGE_JOIN */ COUNT(*) >>>>>> FROM (SELECT LOCALID FROM MYTABLE >>>>>> WHERE EID IS NULL AND DSID = 'PEOPLE' AND HAS_CANDIDATES = FALSE) l >>>>>> JOIN PEOPLE ds ON ds.PERSON_ID = l.LOCALID; >>>>>> >>>>>> +--------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ >>>>>> | PLAN >>>>>> | EST_BYTES_READ | >>>>>> EST_ROWS_READ >>>>>> | EST_INFO_TS | >>>>>> >>>>>> +--------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ >>>>>> | SORT-MERGE-JOIN (INNER) TABLES >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT 15-CHUNK 5908964 ROWS 2831155679 BYTES PARALLEL 3-WAY >>>>>> RANGE SCAN OVER MYTABLE [0] - [2] | 14155777900 <(415)%20577-7900> >>>>>> | 12077867 | 1513754378759 | >>>>>> | SERVER FILTER BY (EID IS NULL AND DSID = 'PEOPLE' AND >>>>>> HAS_CANDIDATES = false) | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT MERGE SORT >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | AND (SKIP MERGE) >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT 42-CHUNK 6168903 ROWS 11324622221 BYTES PARALLEL 42-WAY >>>>>> FULL SCAN OVER PEOPLE | 14155777900 <(415)%20577-7900> >>>>>> | 12077867 | 1513754378759 | >>>>>> | SERVER FILTER BY FIRST KEY ONLY >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | SERVER SORTED BY [DS.PERSON_ID] >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT MERGE SORT >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> | CLIENT AGGREGATE INTO SINGLE ROW >>>>>> | 14155777900 >>>>>> <(415)%20577-7900> | 12077867 | 1513754378759 | >>>>>> >>>>>> +--------------------------------------------------------------------------------------------------------------+-----------------+----------------+----------------+ >>>>>> 10 rows selected (0.041 seconds) >>>>>> >>>>>> What do you think? Whould I try to give more resources to >>>>>> HBase/Phoenix or is the first query the best (and reliable) one? >>>>>> Any insight about this is highly appreciated.. >>>>>> >>>>>> Best, >>>>>> Flavio >>>>>> >>>>> >>>> >> > >