Re: SELECT + ORDER BY vs self-join
Sorry, I forgot to include the DDL for the index: CREATE LOCAL INDEX ix_0 ON traces (UPPER(number)) INCLUDE (time, + some other columns used for filtering). I'm going to file a JIRA with the details, but local indexes are being used in both cases. My suspicion is that the first query will attempt to load all 30M rows of data from HBase and then sort it, and the second will only sort 30M (rowkey, time) pairs. On Tue, Oct 31, 2017 at 8:02 AM, Sergey Soldatovwrote: > I agree with James that this happens because the index was not involved > because it doesn't cover all columns. I believe that in the second case, the > RHT is using the index to create a list of rowkeys and they are used for > point lookups by skipscan. > > bq. When is using the self-join a worse choice than the simple select? > > Hash join has it's own limitations: > 1. RHT is supposed to be small, so it's better to keep LIMIT small (far > less than 30 mil). > 2. Client is always involved to collect data from RHT, build the hash join > cache and send it to all RSes. > > > bq. Is there a better way to construct this query? > > Using local index may help in this case. > > Thanks, > Sergey > > > On Mon, Oct 30, 2017 at 11:26 PM, James Taylor > wrote: >> >> Please file a JIRA and include the explain plan for each of the queries. I >> suspect your index is not being used in the first query due to the selection >> of all the columns. You can try hinting the query to force your index to be >> used. See https://phoenix.apache.org/secondary_indexing.html#Index_Usage >> >> Thanks, >> James >> >> On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz >> wrote: >>> >>> We have a wide table with 100M records created with the following DDL: >>> >>> CREATE TABLE traces ( >>> rowkey VARCHAR PRIMARY KEY, >>> time VARCHAR, >>> number VARCHAR, >>> +40 more columns) >>> >>> We want to select a large (~30M records) subset of this data with the >>> query: >>> >>> SELECT *all columns* >>> FROM traces >>> WHERE (UPPER(number) LIKE 'PO %') >>> ORDER BY time DESC, ROWKEY >>> LIMIT 101; >>> >>> This times out after 15 minutes and puts a huge load on our cluster. >>> We have an alternate way of selecting this data: >>> >>> SELECT t.rowkey, *all columns* >>> FROM TRACES t >>> JOIN ( >>> SELECT rowkey >>> FROM TRACES >>> WHERE (UPPER(number) LIKE 'PO %') >>> ORDER BY time DESC, ROWKEY >>> LIMIT 101 >>> ) ix >>> ON t.ROWKEY = ix.ROWKEY >>> order by t.ROWKEY; >>> >>> Which completes in just under a minute. >>> Is there a better way to construct this query? >>> When is using the self-join a worse choice than the simple select? >>> Given that we have a functional index on UPPER(number), could this >>> potentially be a statistics-based optimizer decision? >>> >>> -- >>> Pozdrawiam, >>> Marcin Januszkiewicz >> >> >
Re: SELECT + ORDER BY vs self-join
I agree with James that this happens because the index was not involved because it doesn't cover all columns. I believe that in the second case, the RHT is using the index to create a list of rowkeys and they are used for point lookups by skipscan. bq. When is using the self-join a worse choice than the simple select? Hash join has it's own limitations: 1. RHT is supposed to be small, so it's better to keep LIMIT small (far less than 30 mil). 2. Client is always involved to collect data from RHT, build the hash join cache and send it to all RSes. bq. Is there a better way to construct this query? Using local index may help in this case. Thanks, Sergey On Mon, Oct 30, 2017 at 11:26 PM, James Taylorwrote: > Please file a JIRA and include the explain plan for each of the queries. I > suspect your index is not being used in the first query due to the > selection of all the columns. You can try hinting the query to force your > index to be used. See https://phoenix.apache.org/secondary_indexing.html# > Index_Usage > > Thanks, > James > > On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz < > januszkiewicz.mar...@gmail.com> wrote: > >> We have a wide table with 100M records created with the following DDL: >> >> CREATE TABLE traces ( >> rowkey VARCHAR PRIMARY KEY, >> time VARCHAR, >> number VARCHAR, >> +40 more columns) >> >> We want to select a large (~30M records) subset of this data with the >> query: >> >> SELECT *all columns* >> FROM traces >> WHERE (UPPER(number) LIKE 'PO %') >> ORDER BY time DESC, ROWKEY >> LIMIT 101; >> >> This times out after 15 minutes and puts a huge load on our cluster. >> We have an alternate way of selecting this data: >> >> SELECT t.rowkey, *all columns* >> FROM TRACES t >> JOIN ( >> SELECT rowkey >> FROM TRACES >> WHERE (UPPER(number) LIKE 'PO %') >> ORDER BY time DESC, ROWKEY >> LIMIT 101 >> ) ix >> ON t.ROWKEY = ix.ROWKEY >> order by t.ROWKEY; >> >> Which completes in just under a minute. >> Is there a better way to construct this query? >> When is using the self-join a worse choice than the simple select? >> Given that we have a functional index on UPPER(number), could this >> potentially be a statistics-based optimizer decision? >> >> -- >> Pozdrawiam, >> Marcin Januszkiewicz >> > >
Re: SELECT + ORDER BY vs self-join
I suspect this problem is similar to PHOENIX-4288. On Mon, Oct 30, 2017 at 11:26 PM James Taylorwrote: > Please file a JIRA and include the explain plan for each of the queries. I > suspect your index is not being used in the first query due to the > selection of all the columns. You can try hinting the query to force your > index to be used. See > https://phoenix.apache.org/secondary_indexing.html#Index_Usage > > Thanks, > James > > On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz < > januszkiewicz.mar...@gmail.com> wrote: > >> We have a wide table with 100M records created with the following DDL: >> >> CREATE TABLE traces ( >> rowkey VARCHAR PRIMARY KEY, >> time VARCHAR, >> number VARCHAR, >> +40 more columns) >> >> We want to select a large (~30M records) subset of this data with the >> query: >> >> SELECT *all columns* >> FROM traces >> WHERE (UPPER(number) LIKE 'PO %') >> ORDER BY time DESC, ROWKEY >> LIMIT 101; >> >> This times out after 15 minutes and puts a huge load on our cluster. >> We have an alternate way of selecting this data: >> >> SELECT t.rowkey, *all columns* >> FROM TRACES t >> JOIN ( >> SELECT rowkey >> FROM TRACES >> WHERE (UPPER(number) LIKE 'PO %') >> ORDER BY time DESC, ROWKEY >> LIMIT 101 >> ) ix >> ON t.ROWKEY = ix.ROWKEY >> order by t.ROWKEY; >> >> Which completes in just under a minute. >> Is there a better way to construct this query? >> When is using the self-join a worse choice than the simple select? >> Given that we have a functional index on UPPER(number), could this >> potentially be a statistics-based optimizer decision? >> >> -- >> Pozdrawiam, >> Marcin Januszkiewicz >> > >
Re: SELECT + ORDER BY vs self-join
Please file a JIRA and include the explain plan for each of the queries. I suspect your index is not being used in the first query due to the selection of all the columns. You can try hinting the query to force your index to be used. See https://phoenix.apache.org/secondary_indexing.html#Index_Usage Thanks, James On Mon, Oct 30, 2017 at 7:02 AM, Marcin Januszkiewicz < januszkiewicz.mar...@gmail.com> wrote: > We have a wide table with 100M records created with the following DDL: > > CREATE TABLE traces ( > rowkey VARCHAR PRIMARY KEY, > time VARCHAR, > number VARCHAR, > +40 more columns) > > We want to select a large (~30M records) subset of this data with the > query: > > SELECT *all columns* > FROM traces > WHERE (UPPER(number) LIKE 'PO %') > ORDER BY time DESC, ROWKEY > LIMIT 101; > > This times out after 15 minutes and puts a huge load on our cluster. > We have an alternate way of selecting this data: > > SELECT t.rowkey, *all columns* > FROM TRACES t > JOIN ( > SELECT rowkey > FROM TRACES > WHERE (UPPER(number) LIKE 'PO %') > ORDER BY time DESC, ROWKEY > LIMIT 101 > ) ix > ON t.ROWKEY = ix.ROWKEY > order by t.ROWKEY; > > Which completes in just under a minute. > Is there a better way to construct this query? > When is using the self-join a worse choice than the simple select? > Given that we have a functional index on UPPER(number), could this > potentially be a statistics-based optimizer decision? > > -- > Pozdrawiam, > Marcin Januszkiewicz >