If your filter matches few rows due to filter on leading part of PK then
your data might only reside in a single block which leads to less
overall disk reads for non-salted case vs need for multiple blocks reads for
salted one.

On Tuesday, February 2, 2016, Serega Sheypak <serega.shey...@gmail.com>
wrote:

> > then you would be better off not using salt buckets all together rather
> than having 100 parallel scan and block reads in your case. I
> Didn't understand you correctly. What is difference between salted/not
> salted table in case of "primary key leading-part select"?
>
> 2016-02-02 1:18 GMT+01:00 Mujtaba Chohan <mujt...@apache.org
> <javascript:_e(%7B%7D,'cvml','mujt...@apache.org');>>:
>
>> If you are filtering on leading part of row key which is highly selective
>> then you would be better off not using salt buckets all together rather
>> than having 100 parallel scan and block reads in your case. In our test
>> with billion+ row table, non-salted table offer much better performance
>> since it ends up reading fewer blocks from a single region.
>>
>> //mujtaba
>>
>> On Mon, Feb 1, 2016 at 1:16 PM, Serega Sheypak <serega.shey...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','serega.shey...@gmail.com');>> wrote:
>>
>>> Hi, here is my table DDL:
>>> CREATE TABLE IF NOT EXISTS id_ref
>>> (
>>>    id1     VARCHAR   NOT NULL,
>>>    value1  VARCHAR,
>>>
>>>    id2     VARCHAR NOT NULL,
>>>    value2  VARCHAR
>>>    CONSTRAINT id_ref_pk  PRIMARY KEY (id1, id2)
>>> )IMMUTABLE_ROWS=true,SALT_BUCKETS=100, VERSIONS=1, TTL=691200
>>>
>>> I'm trying to analyze result of explain:
>>>
>>> explain select id1, value1, id2, value2 from id_ref where id1 = 'xxx'
>>>
>>> . . . . . . . . . . . . . . . . . . . . . . .> ;
>>>
>>> *+------------------------------------------+*
>>>
>>> *| **                  PLAN                  ** |*
>>>
>>> *+------------------------------------------+*
>>>
>>> *| *CLIENT 100-CHUNK PARALLEL 100-WAY RANGE SCAN OVER ID_REF
>>> [0,'1fd5c44a75549162ca1602dda55f6d129cab61a6']* |*
>>>
>>> *| *CLIENT MERGE SORT                       * |*
>>>
>>> *+------------------------------------------+*
>>>
>>>
>>> What happens? Client spawns 100 parallel scans (because of bucketing)
>>> and waits for 100 responses?
>>>
>>> Is it effective? What is the right way to optimize such query pattern:
>>> "select by first part of primary key"? Reduce the amount of buckets? I get
>>> exeption a while after restarting app:
>>>
>>>
>>> *Task org.apache.phoenix.job.JobManager$JobFutureTask@60a40644 rejected
>>> from org.apache.phoenix.job.JobManager$1@58e3fe9aRunning, pool size = 128,
>>> active threads = 121, queued tasks = 5000, completed tasks = 2629565*
>>>
>>>
>>>
>>
>

Reply via email to