If your filter matches few rows due to filter on leading part of PK then your data might only reside in a single block which leads to less overall disk reads for non-salted case vs need for multiple blocks reads for salted one.
On Tuesday, February 2, 2016, Serega Sheypak <serega.shey...@gmail.com> wrote: > > then you would be better off not using salt buckets all together rather > than having 100 parallel scan and block reads in your case. I > Didn't understand you correctly. What is difference between salted/not > salted table in case of "primary key leading-part select"? > > 2016-02-02 1:18 GMT+01:00 Mujtaba Chohan <mujt...@apache.org > <javascript:_e(%7B%7D,'cvml','mujt...@apache.org');>>: > >> If you are filtering on leading part of row key which is highly selective >> then you would be better off not using salt buckets all together rather >> than having 100 parallel scan and block reads in your case. In our test >> with billion+ row table, non-salted table offer much better performance >> since it ends up reading fewer blocks from a single region. >> >> //mujtaba >> >> On Mon, Feb 1, 2016 at 1:16 PM, Serega Sheypak <serega.shey...@gmail.com >> <javascript:_e(%7B%7D,'cvml','serega.shey...@gmail.com');>> wrote: >> >>> Hi, here is my table DDL: >>> CREATE TABLE IF NOT EXISTS id_ref >>> ( >>> id1 VARCHAR NOT NULL, >>> value1 VARCHAR, >>> >>> id2 VARCHAR NOT NULL, >>> value2 VARCHAR >>> CONSTRAINT id_ref_pk PRIMARY KEY (id1, id2) >>> )IMMUTABLE_ROWS=true,SALT_BUCKETS=100, VERSIONS=1, TTL=691200 >>> >>> I'm trying to analyze result of explain: >>> >>> explain select id1, value1, id2, value2 from id_ref where id1 = 'xxx' >>> >>> . . . . . . . . . . . . . . . . . . . . . . .> ; >>> >>> *+------------------------------------------+* >>> >>> *| ** PLAN ** |* >>> >>> *+------------------------------------------+* >>> >>> *| *CLIENT 100-CHUNK PARALLEL 100-WAY RANGE SCAN OVER ID_REF >>> [0,'1fd5c44a75549162ca1602dda55f6d129cab61a6']* |* >>> >>> *| *CLIENT MERGE SORT * |* >>> >>> *+------------------------------------------+* >>> >>> >>> What happens? Client spawns 100 parallel scans (because of bucketing) >>> and waits for 100 responses? >>> >>> Is it effective? What is the right way to optimize such query pattern: >>> "select by first part of primary key"? Reduce the amount of buckets? I get >>> exeption a while after restarting app: >>> >>> >>> *Task org.apache.phoenix.job.JobManager$JobFutureTask@60a40644 rejected >>> from org.apache.phoenix.job.JobManager$1@58e3fe9aRunning, pool size = 128, >>> active threads = 121, queued tasks = 5000, completed tasks = 2629565* >>> >>> >>> >> >