Re: Select date range performance issue

2017-02-24 Thread NaHeon Kim
Jonathan,

So much helpful explanations!
Skip scan and index including any column in a query make a big difference.
I'll try row timestamp more and see what happens.

Thanks,
NaHeon

2017-02-24 13:25 GMT+09:00 Jonathan Leech :

> 1. No, I am not confused. A skip scan would "skip" over entire ranges of
> obj_id and all create_dt values for it. This will only be effective if
> there are many less distinct values of obj_id than there are total rows. If
> there are too many distinct obj_ids then it either wont speed the query up
> at all, or not enough, but it's simple to try it and see.
>
> 2. Your index isnt used because it doesn't contain the other columns used
> in the query; e.g your query is isn't "covered". You get the column(s)
> defined in the index + anything in the rowkey. You can also use the
> "include" keyword to add other columns to the index. Alternatively, you can
> look at "local" indexes, or it may be possible to nest the query with a
> sub-select to fetch the desired primary key values from the index by
> create_dt and the others from the main table.
>
> 3. No, not all. Phoenix will assign the the internal hbase timestamp of
> the row to whatever you set to create_dt. It can also automatically set it
> to the current time when you create the row, if you want it to. This has
> other implications; e.g if you set a TTL, versions, etc in hbase. It can
> speed up queries, especially those that execute on the most recent data
> written, but prior to hbase compaction. Advanced stuff and performance is
> highly dependent on your specific use case and hbase compaction settings...
>
> On Feb 23, 2017, at 7:59 PM, NaHeon Kim  wrote:
>
> Thanks for suggestion.
>
> Here's further questions:
> 1. create_dt (not obj_id, I think you confused) would have large sets of
> date, so SKIP_SCAN hint might be not useful.
>
> 2. I created secondary index on create_dt
>create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;
>
> However, EXPLAIN still shows query plan of FULL SCAN.
> Giving index hint on SELECT doesn't work as well.
>
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
> NEWS_KEYWORD_COUNT
>SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
>
> 3. ROW_TIMESTAMP is time of current query execution time, right?
> Then it's not a right choice. :-(
>
>
> 2017-02-24 1:54 GMT+09:00 Jonathan Leech :
>
>> If there are not a large number of distinct values of obj_id, try a
>> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's
>> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature
>> if it fits your use case.
>>
>> On Feb 22, 2017, at 11:30 PM, NaHeon Kim  wrote:
>>
>> Hi all,
>> I've seen performance problem when selecting rows within date range.
>>
>> My table schema is:
>>
>> CREATE TABLE MY_TABLE (
>>OBJ_ID varchar(20) not null,
>>CREATE_DT timestamp not null,
>>KEYWORD varchar(100) not null,
>>COUNT integer,
>>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
>> );
>>
>> MY_TABLE has almost 5,200,000 rows,
>> CREATE_DT has about 6 months range.
>>
>> And executed query:
>>
>> SELECT KEYWORD, SUM(COUNT)
>> FROM MY_TABLE
>> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
>> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
>> GROUP BY KEYWORD;
>>
>> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of
>> row key.
>> I created a secondary index on CREATE_DT but there's no improvement.
>>
>> Query plan looks weird:
>> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
>> NEWS_KEYWORD_COUNT
>>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000'
>> AND CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
>> CLIENT MERGE SORT
>> CLIENT 100 ROW LIMIT
>>
>> BUT If CREATE_DT comes first of row key, plan says range scan will be
>> done.
>>
>> Any suggestion? : )
>>
>> Thanks,
>> NaHeon
>>
>>
>


Re: Select date range performance issue

2017-02-23 Thread Jonathan Leech
1. No, I am not confused. A skip scan would "skip" over entire ranges of obj_id 
and all create_dt values for it. This will only be effective if there are many 
less distinct values of obj_id than there are total rows. If there are too many 
distinct obj_ids then it either wont speed the query up at all, or not enough, 
but it's simple to try it and see.

2. Your index isnt used because it doesn't contain the other columns used in 
the query; e.g your query is isn't "covered". You get the column(s) defined in 
the index + anything in the rowkey. You can also use the "include" keyword to 
add other columns to the index. Alternatively, you can look at "local" indexes, 
or it may be possible to nest the query with a sub-select to fetch the desired 
primary key values from the index by create_dt and the others from the main 
table.

3. No, not all. Phoenix will assign the the internal hbase timestamp of the row 
to whatever you set to create_dt. It can also automatically set it to the 
current time when you create the row, if you want it to. This has other 
implications; e.g if you set a TTL, versions, etc in hbase. It can speed up 
queries, especially those that execute on the most recent data written, but 
prior to hbase compaction. Advanced stuff and performance is highly dependent 
on your specific use case and hbase compaction settings...

> On Feb 23, 2017, at 7:59 PM, NaHeon Kim  wrote:
> 
> Thanks for suggestion.
> 
> Here's further questions:
> 1. create_dt (not obj_id, I think you confused) would have large sets of 
> date, so SKIP_SCAN hint might be not useful.
> 
> 2. I created secondary index on create_dt
>create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;
> 
> However, EXPLAIN still shows query plan of FULL SCAN.
> Giving index hint on SELECT doesn't work as well.
> 
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT
>SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND 
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
> 
> 3. ROW_TIMESTAMP is time of current query execution time, right?
> Then it's not a right choice. :-(
> 
> 
> 2017-02-24 1:54 GMT+09:00 Jonathan Leech :
>> If there are not a large number of distinct values of obj_id, try a 
>> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's 
>> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature 
>> if it fits your use case.
>> 
>>> On Feb 22, 2017, at 11:30 PM, NaHeon Kim  wrote:
>>> 
>>> Hi all,
>>> I've seen performance problem when selecting rows within date range.
>>> 
>>> My table schema is:
>>> 
>>> CREATE TABLE MY_TABLE (
>>>OBJ_ID varchar(20) not null,
>>>CREATE_DT timestamp not null,
>>>KEYWORD varchar(100) not null,
>>>COUNT integer,
>>>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
>>> );
>>> 
>>> MY_TABLE has almost 5,200,000 rows,
>>> CREATE_DT has about 6 months range.
>>> 
>>> And executed query:
>>> 
>>> SELECT KEYWORD, SUM(COUNT)
>>> FROM MY_TABLE
>>> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
>>> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
>>> GROUP BY KEYWORD;
>>> 
>>> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row 
>>> key.
>>> I created a secondary index on CREATE_DT but there's no improvement.
>>> 
>>> Query plan looks weird:
>>> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER 
>>> NEWS_KEYWORD_COUNT
>>>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND 
>>> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>>>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
>>> CLIENT MERGE SORT
>>> CLIENT 100 ROW LIMIT
>>> 
>>> BUT If CREATE_DT comes first of row key, plan says range scan will be done.
>>> 
>>> Any suggestion? : )
>>> 
>>> Thanks,
>>> NaHeon
>>> 
> 


Re: Select date range performance issue

2017-02-23 Thread NaHeon Kim
Thanks for suggestion.

Here's further questions:
1. create_dt (not obj_id, I think you confused) would have large sets of
date, so SKIP_SCAN hint might be not useful.

2. I created secondary index on create_dt
   create index IDX1_CREATE_DT on MY_TABLE(CREATE_DT;

However, EXPLAIN still shows query plan of FULL SCAN.
Giving index hint on SELECT doesn't work as well.

CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
NEWS_KEYWORD_COUNT
   SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
   SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
CLIENT MERGE SORT
CLIENT 100 ROW LIMIT

3. ROW_TIMESTAMP is time of current query execution time, right?
Then it's not a right choice. :-(


2017-02-24 1:54 GMT+09:00 Jonathan Leech :

> If there are not a large number of distinct values of obj_id, try a
> SKIP_SCAN hint. Otherwise, the secondary index should work, make sure it's
> actually used via explain. Finally, you might try the ROW_TIMESTAMP feature
> if it fits your use case.
>
> On Feb 22, 2017, at 11:30 PM, NaHeon Kim  wrote:
>
> Hi all,
> I've seen performance problem when selecting rows within date range.
>
> My table schema is:
>
> CREATE TABLE MY_TABLE (
>OBJ_ID varchar(20) not null,
>CREATE_DT timestamp not null,
>KEYWORD varchar(100) not null,
>COUNT integer,
>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
> );
>
> MY_TABLE has almost 5,200,000 rows,
> CREATE_DT has about 6 months range.
>
> And executed query:
>
> SELECT KEYWORD, SUM(COUNT)
> FROM MY_TABLE
> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
> GROUP BY KEYWORD;
>
> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row
> key.
> I created a secondary index on CREATE_DT but there's no improvement.
>
> Query plan looks weird:
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
> NEWS_KEYWORD_COUNT
>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
>
> BUT If CREATE_DT comes first of row key, plan says range scan will be done.
>
> Any suggestion? : )
>
> Thanks,
> NaHeon
>
>


Re: Select date range performance issue

2017-02-23 Thread Jonathan Leech
If there are not a large number of distinct values of obj_id, try a SKIP_SCAN 
hint. Otherwise, the secondary index should work, make sure it's actually used 
via explain. Finally, you might try the ROW_TIMESTAMP feature if it fits your 
use case.

> On Feb 22, 2017, at 11:30 PM, NaHeon Kim  wrote:
> 
> Hi all,
> I've seen performance problem when selecting rows within date range.
> 
> My table schema is:
> 
> CREATE TABLE MY_TABLE (
>OBJ_ID varchar(20) not null,
>CREATE_DT timestamp not null,
>KEYWORD varchar(100) not null,
>COUNT integer,
>CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
> );
> 
> MY_TABLE has almost 5,200,000 rows,
> CREATE_DT has about 6 months range.
> 
> And executed query:
> 
> SELECT KEYWORD, SUM(COUNT)
> FROM MY_TABLE
> WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
> AND CREATE_DT < to_timestamp('2016-04-01 00:00')
> GROUP BY KEYWORD;
> 
> It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row 
> key.
> I created a secondary index on CREATE_DT but there's no improvement.
> 
> Query plan looks weird:
> CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER NEWS_KEYWORD_COUNT
>  SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND 
> CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
>  SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
> CLIENT MERGE SORT
> CLIENT 100 ROW LIMIT
> 
> BUT If CREATE_DT comes first of row key, plan says range scan will be done.
> 
> Any suggestion? : )
> 
> Thanks,
> NaHeon
> 


Select date range performance issue

2017-02-22 Thread NaHeon Kim
Hi all,
I've seen performance problem when selecting rows within date range.

My table schema is:

CREATE TABLE MY_TABLE (
   OBJ_ID varchar(20) not null,
   CREATE_DT timestamp not null,
   KEYWORD varchar(100) not null,
   COUNT integer,
   CONSTRAINT PK PRIMARY KEY (OBJ_ID,CREATE_DT,KEYWORD)
);

MY_TABLE has almost 5,200,000 rows,
CREATE_DT has about 6 months range.

And executed query:

SELECT KEYWORD, SUM(COUNT)
FROM MY_TABLE
WHERE CREATE_DT > to_timestamp('2016-03-01 00:00')
AND CREATE_DT < to_timestamp('2016-04-01 00:00')
GROUP BY KEYWORD;

It tooks 46 seconds, too slow than expected, cause CREATE_DT is one of row
key.
I created a secondary index on CREATE_DT but there's no improvement.

Query plan looks weird:
CLIENT 2-CHUNK 0 ROWS 0 BYTES PARALLEL 2-WAY FULL SCAN OVER
NEWS_KEYWORD_COUNT
 SERVER FILTER BY (CREATE_DT > TIMESTAMP '2017-03-01 00:00:00.000' AND
CREATE_DT < TIMESTAMP '2017-04-01 00:00:00.000')
 SERVER AGGREGATE INTO DISTINCT ROWS BY [KEYWORD]
CLIENT MERGE SORT
CLIENT 100 ROW LIMIT

BUT If CREATE_DT comes first of row key, plan says range scan will be done.

Any suggestion? : )

Thanks,
NaHeon