Re: Problem in finding the largest value of an indexed column

James Taylor Thu, 02 Jul 2015 19:56:11 -0700

On further investigation, I believe it should have been working before. I
did a bit of cleanup and attached a new patch to PHOENIX-2096, but this
would only prevent a merge sort when one is not required (basically
improving performance).


Maybe your index is invalid? You can try rebuilding with this command:
https://phoenix.apache.org/language/index.html#alter_index

On Thu, Jul 2, 2015 at 5:26 PM, Yufan Liu <[email protected]> wrote:

> The query on test dataset is returning the expected result with the patch.
> But on the original dataset (10million rows, 6 regions), it still return
> the same unexpected result, I will dig more into this. Thank you, James!
>
> 2015-07-02 9:58 GMT-07:00 Yufan Liu <[email protected]>:
>
>> Sure, let me have a try
>>
>> 2015-07-02 9:46 GMT-07:00 James Taylor <[email protected]>:
>>
>>> Thanks, Yufan. I found an issue and filed PHOENIX-2096 with a patch.
>>> Would you mind confirming that this fixes the issue you're seeing?
>>>
>>>     James
>>>
>>> On Thu, Jul 2, 2015 at 9:45 AM, Yufan Liu <[email protected]> wrote:
>>>
>>>> I'm using 4.4.0-HBase-0.98
>>>>
>>>> 2015-07-01 22:31 GMT-07:00 James Taylor <[email protected]>:
>>>>
>>>>> Yufan,
>>>>> What version of Phoenix are you using?
>>>>> Thanks,
>>>>> James
>>>>>
>>>>> On Wed, Jul 1, 2015 at 2:34 PM, Yufan Liu <[email protected]> wrote:
>>>>>
>>>>>> When I made more tests, I find that this problem happens after table
>>>>>> got split.
>>>>>>
>>>>>> Here is the DDL I use to create table and index:
>>>>>> CREATE TABLE IF NOT EXISTS t1 (
>>>>>> uid BIGINT NOT NULL,
>>>>>> timestamp BIGINT NOT NULL,
>>>>>> eventName VARCHAR
>>>>>> CONSTRAINT my_pk PRIMARY KEY (uid,  timestamp)) COMPRESSION='SNAPPY';
>>>>>>
>>>>>> CREATE INDEX timestamp_index ON t1 (timestamp) INCLUDE (eventName)
>>>>>>
>>>>>> Attach is the sample data I used for test. It has about 4000 rows,
>>>>>> when the timestamp_index table has one region, the query returns correct
>>>>>> result: 1433334048443, but when I manually split it into 4 regions (use
>>>>>> hbase tool), it returns 1433333024961.
>>>>>>
>>>>>> Let know if you find anything. Thanks!
>>>>>>
>>>>>>
>>>>>> 2015-07-01 11:27 GMT-07:00 James Taylor <[email protected]>:
>>>>>>
>>>>>>> If you could put a complete test (including your DDL and upsert of
>>>>>>> data), that would be much appreciated.
>>>>>>> Thanks,
>>>>>>> James
>>>>>>>
>>>>>>> On Wed, Jul 1, 2015 at 11:20 AM, Yufan Liu <[email protected]> wrote:
>>>>>>>
>>>>>>>> I have tried to use query "SELECT timestamp FROM t1 ORDER BY
>>>>>>>> timestamp DESC NULLS LAST LIMIT 1". But it still returns the same
>>>>>>>> unexpected result. There seems to be some internal problems related.
>>>>>>>>
>>>>>>>> 2015-06-30 18:03 GMT-07:00 James Taylor <[email protected]>:
>>>>>>>>
>>>>>>>>> Yes, reverse scan will be leveraged when possible. Make you use
>>>>>>>>> NULLS LAST in your ORDER BY as rows are ordered with nulls first.
>>>>>>>>>
>>>>>>>>> On Tue, Jun 30, 2015 at 5:25 PM, Yufan Liu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I used the HBase reverse scan to find the last row on the index
>>>>>>>>>> table. It returned the expected result. I would like to know is 
>>>>>>>>>> Phoenix's
>>>>>>>>>> "ORDER BY"
>>>>>>>>>> and "DESC" implemented based on HBase reverse scan?
>>>>>>>>>>
>>>>>>>>>> 2015-06-26 17:25 GMT-07:00 Yufan Liu <[email protected]>:
>>>>>>>>>>
>>>>>>>>>>> Thank you anyway, Michael!
>>>>>>>>>>>
>>>>>>>>>>> 2015-06-26 17:21 GMT-07:00 Michael McAllister <
>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>
>>>>>>>>>>>>  OK, I’m a Phoenix newbie, so that was the extent of the
>>>>>>>>>>>> advice I could give you. There are people here far more 
>>>>>>>>>>>> experienced than I
>>>>>>>>>>>> am who should be able to give you deeper advice. Have a great 
>>>>>>>>>>>> weekend!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Mike
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *From:* Yufan Liu [mailto:[email protected]]
>>>>>>>>>>>> *Sent:* Friday, June 26, 2015 7:19 PM
>>>>>>>>>>>> *To:* [email protected]
>>>>>>>>>>>> *Subject:* Re: Problem in finding the largest value of an
>>>>>>>>>>>> indexed column
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the advice, for the first one, it's "CLIENT 67-CHUNK
>>>>>>>>>>>> PARALLEL 1-WAY FULL SCAN OVER TIMESTAMP_INDEX; SERVER FILTER BY 
>>>>>>>>>>>> FIRST KEY
>>>>>>>>>>>> ONLY; SERVER AGGREGATE INTO SINGLE ROW" which is as expected. For 
>>>>>>>>>>>> the
>>>>>>>>>>>> second one, it's "CLIENT 67-CHUNK SERIAL 1-WAY REVERSE FULL SCAN 
>>>>>>>>>>>> OVER
>>>>>>>>>>>> TIMESTAMP_INDEX; SERVER FILTER BY FIRST KEY ONLY; SERVER 1 ROW 
>>>>>>>>>>>> LIMIT" which
>>>>>>>>>>>> looks correct, but still returns the unexpected result.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-06-26 16:59 GMT-07:00 Michael McAllister <
>>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>>
>>>>>>>>>>>> Yufan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Have you tried using the EXPLAIN command to see what plan is
>>>>>>>>>>>> being used to access the data?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Michael McAllister
>>>>>>>>>>>>
>>>>>>>>>>>> Staff Data Warehouse Engineer | Decision Systems
>>>>>>>>>>>>
>>>>>>>>>>>> [email protected] | C: 512.423.7447 | skype:
>>>>>>>>>>>> michael.mcallister.ha <[email protected]> | webex:
>>>>>>>>>>>> https://h.a/mikewebex
>>>>>>>>>>>>
>>>>>>>>>>>> [image: Description: Description: cid:3410354473_30269081]
>>>>>>>>>>>>
>>>>>>>>>>>> This electronic communication (including any attachment) is
>>>>>>>>>>>> confidential.  If you are not an intended recipient of this 
>>>>>>>>>>>> communication,
>>>>>>>>>>>> please be advised that any disclosure, dissemination, 
>>>>>>>>>>>> distribution, copying
>>>>>>>>>>>> or other use of this communication or any attachment is strictly
>>>>>>>>>>>> prohibited.  If you have received this communication in error, 
>>>>>>>>>>>> please
>>>>>>>>>>>> notify the sender immediately by reply e-mail and promptly destroy 
>>>>>>>>>>>> all
>>>>>>>>>>>> electronic and printed copies of this communication and any 
>>>>>>>>>>>> attachment.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *From:* Yufan Liu [mailto:[email protected]]
>>>>>>>>>>>> *Sent:* Friday, June 26, 2015 6:31 PM
>>>>>>>>>>>> *To:* [email protected]
>>>>>>>>>>>> *Subject:* Problem in finding the largest value of an indexed
>>>>>>>>>>>> column
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> We have created a table (eg, t1), and a global index of one
>>>>>>>>>>>> numeric column of t1 (eg, timestamp). Now we want to find the 
>>>>>>>>>>>> largest value
>>>>>>>>>>>> of timestamp, we have tried two approaches:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 1. select max(timestamp) from t1; This query takes forever to
>>>>>>>>>>>> finish, so I think it maybe doing a full table scan/comparison .
>>>>>>>>>>>>
>>>>>>>>>>>> 2. select timestamp from t1 order by timestamp desc limit 1;
>>>>>>>>>>>> This query finished fast, but the result it returns is far from 
>>>>>>>>>>>> the largest
>>>>>>>>>>>> value. It seems it just return the largest value for a certain 
>>>>>>>>>>>> range of
>>>>>>>>>>>> data.
>>>>>>>>>>>>
>>>>>>>>>>>> Did anyone else encounter this issue/have any suggestion?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Yufan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> best,
>>>>>>>>>>>> Yufan
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> best,
>>>>>>>>>>> Yufan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> best,
>>>>>>>>>> Yufan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> best,
>>>>>>>> Yufan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> best,
>>>>>> Yufan
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> best,
>>>> Yufan
>>>>
>>>>
>>>
>>
>>
>> --
>> best,
>> Yufan
>>
>>
>
>
> --
> best,
> Yufan
>
>

Re: Problem in finding the largest value of an indexed column

Reply via email to