Re: Select date range performance issue
Jonathan, So much helpful explanations! Skip scan and index including any column in a query make a big difference. I'll try row timestamp more and see what happens. Thanks, NaHeon 2017-02-24 13:25 GMT+09:00 Jonathan Leech: > 1. No, I am not confused. A skip scan would "skip" over entire ranges of > obj_id and all create_dt values for it. This will only be effective if > there are many less distinct values of obj_id than there are total rows. If > there are too many distinct obj_ids then it either wont speed the query up > at all, or not enough, but it's simple to try it and see. > > 2. Your index isnt used because it doesn't contain the other columns used > in the query; e.g your query is isn't "covered". You get the column(s) > defined in the index + anything in the rowkey. You can also use the > "include" keyword to add other columns to the index. Alternatively, you can > look at "local" indexes, or it may be possible to nest the query with a > sub-select to fetch the desired primary key values from the index by > create_dt and the others from the main table. > > 3. No, not all. Phoenix will assign the the internal hbase timestamp of > the row to whatever you set to create_dt. It can also automatically set it > to the current time when you create the row, if you want it to. This has > other implications; e.g if you set a TTL, versions, etc in hbase. It can > speed up queries, especially those that execute on the most recent data > written, but prior to hbase compaction. Advanced stuff and performance is > highly dependent on your specific use case and hbase compaction settings... > > On Feb 23, 2017, at 7:59 PM, NaHeon Kim wrote: > > Thanks for suggestion. > > Here's further questions: > 1. create_dt (not obj_id, I think you confused) would have large sets of > date, so SKIP_SCAN hint might be not useful. > > 2. I created secondary index on create_dt >create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT; > > However, EXPLAIN still shows query plan of FULL SCAN. > Giving index hint on SELECT doesn't work as well. > > CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER > NEWS_KEYWORD_COUNT >SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND > CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000') >SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD] > CLIENT MERGE SORT > CLIENT 100 ROW LIMIT > > 3. ROW_TIMESTAMP is time of current query execution time, right? > Then it's not a right choice. :-( > > > 2017-02-24 1:54 GMT+09:00 Jonathan Leech : > >> If there are not a large number of distinct values of obj_id, try a >> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's >> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature >> if it fits your use case. >> >> On Feb 22, 2017, at 11:30 PM, NaHeon Kim wrote: >> >> Hi all, >> I've seen performance problem when selecting rows within date range. >> >> My table schema is: >> >> CREATE TABLE MY_TABLE ( >>OBJ_ID varchar(20) not null, >>CREATE_DT timestamp not null, >>KEYWORD varchar(100) not null, >>COUNT integer, >>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD) >> ); >> >> MY_TABLE has almost 5,200,000 rows, >> CREATE_DT has about 6 months range. >> >> And executed query: >> >> SELECT KEYWORD, SUM(COUNT) >> FROM MY_TABLE >> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00') >> AND CREATE_DT < to_timestamp('2016-04-01 00:00') >> GROUP BY KEYWORD; >> >> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of >> row key. >> I created a secondary index on CREATE_DT but there's no improvement. >> >> Query plan looks weird: >> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER >> NEWS_KEYWORD_COUNT >> SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' >> AND CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000') >> SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD] >> CLIENT MERGE SORT >> CLIENT 100 ROW LIMIT >> >> BUT If CREATE_DT comes first of row key, plan says range scan will be >> done. >> >> Any suggestion? : ) >> >> Thanks, >> NaHeon >> >> >
Re: Select date range performance issue
1. No, I am not confused. A skip scan would "skip" over entire ranges of obj_id and all create_dt values for it. This will only be effective if there are many less distinct values of obj_id than there are total rows. If there are too many distinct obj_ids then it either wont speed the query up at all, or not enough, but it's simple to try it and see. 2. Your index isnt used because it doesn't contain the other columns used in the query; e.g your query is isn't "covered". You get the column(s) defined in the index + anything in the rowkey. You can also use the "include" keyword to add other columns to the index. Alternatively, you can look at "local" indexes, or it may be possible to nest the query with a sub-select to fetch the desired primary key values from the index by create_dt and the others from the main table. 3. No, not all. Phoenix will assign the the internal hbase timestamp of the row to whatever you set to create_dt. It can also automatically set it to the current time when you create the row, if you want it to. This has other implications; e.g if you set a TTL, versions, etc in hbase. It can speed up queries, especially those that execute on the most recent data written, but prior to hbase compaction. Advanced stuff and performance is highly dependent on your specific use case and hbase compaction settings... > On Feb 23, 2017, at 7:59 PM, NaHeon Kimwrote: > > Thanks for suggestion. > > Here's further questions: > 1. create_dt (not obj_id, I think you confused) would have large sets of > date, so SKIP_SCAN hint might be not useful. > > 2. I created secondary index on create_dt >create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT; > > However, EXPLAIN still shows query plan of FULL SCAN. > Giving index hint on SELECT doesn't work as well. > > CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT >SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND > CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000') >SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD] > CLIENT MERGE SORT > CLIENT 100 ROW LIMIT > > 3. ROW_TIMESTAMP is time of current query execution time, right? > Then it's not a right choice. :-( > > > 2017-02-24 1:54 GMT+09:00 Jonathan Leech : >> If there are not a large number of distinct values of obj_id, try a >> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's >> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature >> if it fits your use case. >> >>> On Feb 22, 2017, at 11:30 PM, NaHeon Kim wrote: >>> >>> Hi all, >>> I've seen performance problem when selecting rows within date range. >>> >>> My table schema is: >>> >>> CREATE TABLE MY_TABLE ( >>>OBJ_ID varchar(20) not null, >>>CREATE_DT timestamp not null, >>>KEYWORD varchar(100) not null, >>>COUNT integer, >>>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD) >>> ); >>> >>> MY_TABLE has almost 5,200,000 rows, >>> CREATE_DT has about 6 months range. >>> >>> And executed query: >>> >>> SELECT KEYWORD, SUM(COUNT) >>> FROM MY_TABLE >>> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00') >>> AND CREATE_DT < to_timestamp('2016-04-01 00:00') >>> GROUP BY KEYWORD; >>> >>> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row >>> key. >>> I created a secondary index on CREATE_DT but there's no improvement. >>> >>> Query plan looks weird: >>> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER >>> NEWS_KEYWORD_COUNT >>> SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND >>> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000') >>> SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD] >>> CLIENT MERGE SORT >>> CLIENT 100 ROW LIMIT >>> >>> BUT If CREATE_DT comes first of row key, plan says range scan will be done. >>> >>> Any suggestion? : ) >>> >>> Thanks, >>> NaHeon >>> >
Re: Select date range performance issue
Thanks for suggestion. Here's further questions: 1. create_dt (not obj_id, I think you confused) would have large sets of date, so SKIP_SCAN hint might be not useful. 2. I created secondary index on create_dt create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT; However, EXPLAIN still shows query plan of FULL SCAN. Giving index hint on SELECT doesn't work as well. CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000') SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD] CLIENT MERGE SORT CLIENT 100 ROW LIMIT 3. ROW_TIMESTAMP is time of current query execution time, right? Then it's not a right choice. :-( 2017-02-24 1:54 GMT+09:00 Jonathan Leech: > If there are not a large number of distinct values of obj_id, try a > SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's > actually used via explain. Finally, you might try the ROW_TIMESTAMP feature > if it fits your use case. > > On Feb 22, 2017, at 11:30 PM, NaHeon Kim wrote: > > Hi all, > I've seen performance problem when selecting rows within date range. > > My table schema is: > > CREATE TABLE MY_TABLE ( >OBJ_ID varchar(20) not null, >CREATE_DT timestamp not null, >KEYWORD varchar(100) not null, >COUNT integer, >CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD) > ); > > MY_TABLE has almost 5,200,000 rows, > CREATE_DT has about 6 months range. > > And executed query: > > SELECT KEYWORD, SUM(COUNT) > FROM MY_TABLE > WHERE CREATE_DT > to_timestamp('2016-03-01 00:00') > AND CREATE_DT < to_timestamp('2016-04-01 00:00') > GROUP BY KEYWORD; > > It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row > key. > I created a secondary index on CREATE_DT but there's no improvement. > > Query plan looks weird: > CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER > NEWS_KEYWORD_COUNT > SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND > CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000') > SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD] > CLIENT MERGE SORT > CLIENT 100 ROW LIMIT > > BUT If CREATE_DT comes first of row key, plan says range scan will be done. > > Any suggestion? : ) > > Thanks, > NaHeon > >
Re: Select date range performance issue
If there are not a large number of distinct values of obj_id, try a SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's actually used via explain. Finally, you might try the ROW_TIMESTAMP feature if it fits your use case. > On Feb 22, 2017, at 11:30 PM, NaHeon Kimwrote: > > Hi all, > I've seen performance problem when selecting rows within date range. > > My table schema is: > > CREATE TABLE MY_TABLE ( >OBJ_ID varchar(20) not null, >CREATE_DT timestamp not null, >KEYWORD varchar(100) not null, >COUNT integer, >CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD) > ); > > MY_TABLE has almost 5,200,000 rows, > CREATE_DT has about 6 months range. > > And executed query: > > SELECT KEYWORD, SUM(COUNT) > FROM MY_TABLE > WHERE CREATE_DT > to_timestamp('2016-03-01 00:00') > AND CREATE_DT < to_timestamp('2016-04-01 00:00') > GROUP BY KEYWORD; > > It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row > key. > I created a secondary index on CREATE_DT but there's no improvement. > > Query plan looks weird: > CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT > SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND > CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000') > SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD] > CLIENT MERGE SORT > CLIENT 100 ROW LIMIT > > BUT If CREATE_DT comes first of row key, plan says range scan will be done. > > Any suggestion? : ) > > Thanks, > NaHeon >
Select date range performance issue
Hi all, I've seen performance problem when selecting rows within date range. My table schema is: CREATE TABLE MY_TABLE ( OBJ_ID varchar(20) not null, CREATE_DT timestamp not null, KEYWORD varchar(100) not null, COUNT integer, CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD) ); MY_TABLE has almost 5,200,000 rows, CREATE_DT has about 6 months range. And executed query: SELECT KEYWORD, SUM(COUNT) FROM MY_TABLE WHERE CREATE_DT > to_timestamp('2016-03-01 00:00') AND CREATE_DT < to_timestamp('2016-04-01 00:00') GROUP BY KEYWORD; It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row key. I created a secondary index on CREATE_DT but there's no improvement. Query plan looks weird: CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000') SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD] CLIENT MERGE SORT CLIENT 100 ROW LIMIT BUT If CREATE_DT comes first of row key, plan says range scan will be done. Any suggestion? : ) Thanks, NaHeon