Re: Problem in finding the largest value of an indexed column

James Taylor Thu, 09 Jul 2015 23:36:31 -0700

Sounds like something else is going wrong. Can you adapt your test by
setting the MAX_FILESIZE very low for your table (so that it splits after 4
or 5 rows are added) and package it up as a unit test?


On Thu, Jul 9, 2015 at 1:44 PM, Yufan Liu <[email protected]> wrote:

> Just got a chance to revisit this issue: I have rebuilt the index and it
> still returns the unexpected result. By using the test case, I tried to
> insert enough rows to make it auto-split and it reproduces the problem too.
> It seems it still has trouble returning last row sorted by first component
> of primary key on split tables. Maybe there is another issue than
> PHOENIX-2096? The phoenix I am using is pulled from latest 4.x-HBase-0.98
> branch which includes the patch of PHOENIX-2096.
>
> 2015-07-02 19:55 GMT-07:00 James Taylor <[email protected]>:
>
>> On further investigation, I believe it should have been working before. I
>> did a bit of cleanup and attached a new patch to PHOENIX-2096, but this
>> would only prevent a merge sort when one is not required (basically
>> improving performance).
>>
>> Maybe your index is invalid? You can try rebuilding with this command:
>> https://phoenix.apache.org/language/index.html#alter_index
>>
>> On Thu, Jul 2, 2015 at 5:26 PM, Yufan Liu <[email protected]> wrote:
>>
>>> The query on test dataset is returning the expected result with the
>>> patch. But on the original dataset (10million rows, 6 regions), it still
>>> return the same unexpected result, I will dig more into this. Thank you,
>>> James!
>>>
>>> 2015-07-02 9:58 GMT-07:00 Yufan Liu <[email protected]>:
>>>
>>>> Sure, let me have a try
>>>>
>>>> 2015-07-02 9:46 GMT-07:00 James Taylor <[email protected]>:
>>>>
>>>>> Thanks, Yufan. I found an issue and filed PHOENIX-2096 with a patch.
>>>>> Would you mind confirming that this fixes the issue you're seeing?
>>>>>
>>>>>     James
>>>>>
>>>>> On Thu, Jul 2, 2015 at 9:45 AM, Yufan Liu <[email protected]> wrote:
>>>>>
>>>>>> I'm using 4.4.0-HBase-0.98
>>>>>>
>>>>>> 2015-07-01 22:31 GMT-07:00 James Taylor <[email protected]>:
>>>>>>
>>>>>>> Yufan,
>>>>>>> What version of Phoenix are you using?
>>>>>>> Thanks,
>>>>>>> James
>>>>>>>
>>>>>>> On Wed, Jul 1, 2015 at 2:34 PM, Yufan Liu <[email protected]> wrote:
>>>>>>>
>>>>>>>> When I made more tests, I find that this problem happens after
>>>>>>>> table got split.
>>>>>>>>
>>>>>>>> Here is the DDL I use to create table and index:
>>>>>>>> CREATE TABLE IF NOT EXISTS t1 (
>>>>>>>> uid BIGINT NOT NULL,
>>>>>>>> timestamp BIGINT NOT NULL,
>>>>>>>> eventName VARCHAR
>>>>>>>> CONSTRAINT my_pk PRIMARY KEY (uid,  timestamp))
>>>>>>>> COMPRESSION='SNAPPY';
>>>>>>>>
>>>>>>>> CREATE INDEX timestamp_index ON t1 (timestamp) INCLUDE (eventName)
>>>>>>>>
>>>>>>>> Attach is the sample data I used for test. It has about 4000 rows,
>>>>>>>> when the timestamp_index table has one region, the query returns 
>>>>>>>> correct
>>>>>>>> result: 1433334048443, but when I manually split it into 4 regions (use
>>>>>>>> hbase tool), it returns 1433333024961.
>>>>>>>>
>>>>>>>> Let know if you find anything. Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-07-01 11:27 GMT-07:00 James Taylor <[email protected]>:
>>>>>>>>
>>>>>>>>> If you could put a complete test (including your DDL and upsert of
>>>>>>>>> data), that would be much appreciated.
>>>>>>>>> Thanks,
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>> On Wed, Jul 1, 2015 at 11:20 AM, Yufan Liu <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I have tried to use query "SELECT timestamp FROM t1 ORDER BY
>>>>>>>>>> timestamp DESC NULLS LAST LIMIT 1". But it still returns the same
>>>>>>>>>> unexpected result. There seems to be some internal problems related.
>>>>>>>>>>
>>>>>>>>>> 2015-06-30 18:03 GMT-07:00 James Taylor <[email protected]>:
>>>>>>>>>>
>>>>>>>>>>> Yes, reverse scan will be leveraged when possible. Make you use
>>>>>>>>>>> NULLS LAST in your ORDER BY as rows are ordered with nulls first.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 30, 2015 at 5:25 PM, Yufan Liu <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I used the HBase reverse scan to find the last row on the index
>>>>>>>>>>>> table. It returned the expected result. I would like to know is 
>>>>>>>>>>>> Phoenix's
>>>>>>>>>>>> "ORDER BY"
>>>>>>>>>>>> and "DESC" implemented based on HBase reverse scan?
>>>>>>>>>>>>
>>>>>>>>>>>> 2015-06-26 17:25 GMT-07:00 Yufan Liu <[email protected]>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you anyway, Michael!
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2015-06-26 17:21 GMT-07:00 Michael McAllister <
>>>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>  OK, I’m a Phoenix newbie, so that was the extent of the
>>>>>>>>>>>>>> advice I could give you. There are people here far more 
>>>>>>>>>>>>>> experienced than I
>>>>>>>>>>>>>> am who should be able to give you deeper advice. Have a great 
>>>>>>>>>>>>>> weekend!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mike
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *From:* Yufan Liu [mailto:[email protected]]
>>>>>>>>>>>>>> *Sent:* Friday, June 26, 2015 7:19 PM
>>>>>>>>>>>>>> *To:* [email protected]
>>>>>>>>>>>>>> *Subject:* Re: Problem in finding the largest value of an
>>>>>>>>>>>>>> indexed column
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for the advice, for the first one, it's "CLIENT
>>>>>>>>>>>>>> 67-CHUNK PARALLEL 1-WAY FULL SCAN OVER TIMESTAMP_INDEX; SERVER 
>>>>>>>>>>>>>> FILTER BY
>>>>>>>>>>>>>> FIRST KEY ONLY; SERVER AGGREGATE INTO SINGLE ROW" which is as 
>>>>>>>>>>>>>> expected. For
>>>>>>>>>>>>>> the second one, it's "CLIENT 67-CHUNK SERIAL 1-WAY REVERSE FULL 
>>>>>>>>>>>>>> SCAN OVER
>>>>>>>>>>>>>> TIMESTAMP_INDEX; SERVER FILTER BY FIRST KEY ONLY; SERVER 1 ROW 
>>>>>>>>>>>>>> LIMIT" which
>>>>>>>>>>>>>> looks correct, but still returns the unexpected result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2015-06-26 16:59 GMT-07:00 Michael McAllister <
>>>>>>>>>>>>>> [email protected]>:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yufan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Have you tried using the EXPLAIN command to see what plan is
>>>>>>>>>>>>>> being used to access the data?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Michael McAllister
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Staff Data Warehouse Engineer | Decision Systems
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [email protected] | C: 512.423.7447 | skype:
>>>>>>>>>>>>>> michael.mcallister.ha <[email protected]> | webex:
>>>>>>>>>>>>>> https://h.a/mikewebex
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [image: Description: Description: cid:3410354473_30269081]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This electronic communication (including any attachment) is
>>>>>>>>>>>>>> confidential.  If you are not an intended recipient of this 
>>>>>>>>>>>>>> communication,
>>>>>>>>>>>>>> please be advised that any disclosure, dissemination, 
>>>>>>>>>>>>>> distribution, copying
>>>>>>>>>>>>>> or other use of this communication or any attachment is strictly
>>>>>>>>>>>>>> prohibited.  If you have received this communication in error, 
>>>>>>>>>>>>>> please
>>>>>>>>>>>>>> notify the sender immediately by reply e-mail and promptly 
>>>>>>>>>>>>>> destroy all
>>>>>>>>>>>>>> electronic and printed copies of this communication and any 
>>>>>>>>>>>>>> attachment.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *From:* Yufan Liu [mailto:[email protected]]
>>>>>>>>>>>>>> *Sent:* Friday, June 26, 2015 6:31 PM
>>>>>>>>>>>>>> *To:* [email protected]
>>>>>>>>>>>>>> *Subject:* Problem in finding the largest value of an
>>>>>>>>>>>>>> indexed column
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We have created a table (eg, t1), and a global index of one
>>>>>>>>>>>>>> numeric column of t1 (eg, timestamp). Now we want to find the 
>>>>>>>>>>>>>> largest value
>>>>>>>>>>>>>> of timestamp, we have tried two approaches:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. select max(timestamp) from t1; This query takes forever to
>>>>>>>>>>>>>> finish, so I think it maybe doing a full table scan/comparison .
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. select timestamp from t1 order by timestamp desc limit 1;
>>>>>>>>>>>>>> This query finished fast, but the result it returns is far from 
>>>>>>>>>>>>>> the largest
>>>>>>>>>>>>>> value. It seems it just return the largest value for a certain 
>>>>>>>>>>>>>> range of
>>>>>>>>>>>>>> data.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Did anyone else encounter this issue/have any suggestion?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yufan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>> Yufan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> best,
>>>>>>>>>>>>> Yufan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> best,
>>>>>>>>>>>> Yufan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> best,
>>>>>>>>>> Yufan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> best,
>>>>>>>> Yufan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> best,
>>>>>> Yufan
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> best,
>>>> Yufan
>>>>
>>>>
>>>
>>>
>>> --
>>> best,
>>> Yufan
>>>
>>>
>>
>
>
> --
> best,
> Yufan
>
>

Re: Problem in finding the largest value of an indexed column

Reply via email to