Re: "upsert select" with "limit" clause

Jonathan Leech Mon, 17 Dec 2018 18:26:04 -0800

My guess is that in order to enforce the limit that it’s effectively single 
threaded in either the select or the upsert.


> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote:
> 
> Hi Vincent,
> 
> Thanks for explaining. That makes much more sense now and it explains the 
> high memory usage when without "limit" clause. Because it upserts much 
> quickly when using "upsert select" without "limit", the memory usage in 
> client machine is much higher than "upsert select" with "limit" .
> 
> So back to the other question. Can you explain what is underlying Phoenix 
> implementation for "upsert select limit"? Why it is slower than without 
> "limit" when insert a huge number (2m rows) like ""upsert into table2 select 
> * from table1 limit 2,000,000;". This is much slower than inserting the whole 
> table (upsert into table2 select * from table1;).  
> 
> Thanks,
> Xiang 
> 
> 
>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org> wrote:
>> Shawn,
>> Your query upsert into table2 select * from table1;  would not be run on the 
>> server - the source and target table are different.  It would have to be 
>> something like:
>> upsert into table1 select * from table1;
>> 
>> If you want to run server-side upsert select on a target table that is 
>> different from the source table, you need to set 
>> "phoenix.client.enable.server.upsert.select" to true on your client.  
>> The are some other restrictions: the table can't have any global indexes, 
>> and the statement can't have a join or where subquery.  We need to update 
>> the documentation with this information.
>> 
>> The reason there are all these hurdles is because it's generally not 
>> recommended to do server-side upsert select across different tables, because 
>> that means you're doing cross-regionserver RPCs (e.g. read data from a 
>> region of sourcetable, and write to a region of targettable on a different 
>> regionserver), potentially tying up handlers in an unpredictable way.
>> 
>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> wrote:
>>> Hi Jaanai,
>>> 
>>> According to Phoenix website, " If auto commit is on, and both a) the 
>>> target table matches the source table, and b) the select performs no 
>>> aggregation, then the population of the target table will be done 
>>> completely on the server-side (with constraint violations logged, but 
>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto 
>>> commit is on, committed in row batches as specified by the UpsertBatchSize 
>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config 
>>> property which defaults to 10000 rows)"
>>> 
>>> And our sql statement is just: upsert into table2 select * from table1; 
>>> which should match the first case, all operations should be in server site. 
>>> But the memory usage on the client machine is higher than "upsert select 
>>> limit" clause. And the memory usage is check by run 'top'  command under 
>>> Linux. So we are sure it is caused by "select upsert" in Phoenix and not 
>>> others, and can't explain why there is so high memory usage on 
>>> client/gateway machine when all operations are supposed to happen on the 
>>> serve side. 
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.pos...@gmail.com> 
>>>> wrote:
>>>> Shawn,
>>>> 
>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause, 
>>>> only query 1 table, the query is doing aggregation, no sequences and auto 
>>>> commit is on. Please check your SQL ... and you can also check whether 
>>>> some resources have not been released.
>>>> 
>>>> ----------------------------------------
>>>>    Jaanai Zhang
>>>>    Best regards!
>>>> 
>>>> 
>>>> 
>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>> Hi Jaanai,
>>>>> 
>>>>> Thanks for putting your thought. The behavior you describe is correct on 
>>>>> the Hbase region sever side. The memory usage for blockcache and memstore 
>>>>> will be high under such high throughput. But our phoenix client is on a 
>>>>> gateway machine (no hbase region server sitting on it or any Hbase 
>>>>> service on it), so not sure how to explain such high memory usage for 
>>>>> upsert select without "limit" clause. The high memory usage behavior like 
>>>>> all select results send to client machine, cached in client machine's 
>>>>> memory, and then insert back to target table, which is not like the 
>>>>> behavior that should happen, all of this should be done on the server 
>>>>> side as the table schema is exactly the same. By the way, this happens on 
>>>>> both Phoenix 4.7 and Phoenix 4.14.
>>>>>    
>>>>> 
>>>>> Thanks,
>>>>> Shawn 
>>>>> 
>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cloud.pos...@gmail.com> 
>>>>>> wrote:
>>>>>> Shawn,
>>>>>> 
>>>>>> 
>>>>>> For the upsert without limit,  which will read the source table and 
>>>>>> write the target tables on the server side.  I think the higher memory 
>>>>>> usage is caused by using scan cache and memstore under the higher 
>>>>>> throughput.
>>>>>> 
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>> Hi Vincent,
>>>>>>> 
>>>>>>> So you describe limit will sent result to client side and then write to 
>>>>>>> server, this might explain why upsert with limit is slower than without 
>>>>>>> limit. But looks like it can't explain the memory usage? The memory 
>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit), 
>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Shawn 
>>>>>>> 
>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vincentp...@apache.org> 
>>>>>>>> wrote:
>>>>>>>> I think it's done client-side if you have LIMIT.  If you have e.g. 
>>>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 
>>>>>>>> 100, if you have more than one regionserver.  So instead results are 
>>>>>>>> sent back to the client, where the LIMIT is applied and then written 
>>>>>>>> back to the server in the UPSERT.
>>>>>>>> 
>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com> wrote:
>>>>>>>>> Hi Vincent,
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> The table creation statement is similar to below. We have about 200 
>>>>>>>>> fields. Table is mutable and don’t have any index on the table.
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>> 
>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>> 
>>>>>>>>>       city VARCHAR,
>>>>>>>>> 
>>>>>>>>>       population BIGINT,
>>>>>>>>> 
>>>>>>>>>       …
>>>>>>>>> 
>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> 
>>>>>>>>> Shawn
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentp...@apache.org 
>>>>>>>>>> wrote:
>>>>>>>>>> For #2, can you provide the table definition and the statement used? 
>>>>>>>>>>  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>> 
>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li 
>>>>>>>>>>> <shawnli...@gmail.com> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 1.       Want to check what is underlying running for limit clause 
>>>>>>>>>>> used in the following Upsert statement (is it involving any 
>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                                   upsert into table2 select * from 
>>>>>>>>>>> table1 limit 3000000;  (table 1 and table 2 have same schema)
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>               The above statement is running a lot slower than 
>>>>>>>>>>> without “limit”  clause as shown in following, even the above 
>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                                 upsert into table2 select * from 
>>>>>>>>>>> table1;
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> 2.       We also observe memory usable is pretty high without the 
>>>>>>>>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 
>>>>>>>>>>> 20gb without using limit clause.  According to phoenix website 
>>>>>>>>>>> description for upsert select “If auto commit is on, and both a) 
>>>>>>>>>>> the target table matches the source table, and b) the select 
>>>>>>>>>>> performs no aggregation, then the population of the target table 
>>>>>>>>>>> will be done completely on the server-side (with constraint 
>>>>>>>>>>> violations logged, but otherwise ignored).”
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>>                My question is If everything is done on server-side, 
>>>>>>>>>>> how come we have such high memory usage on the client machine?
>>>>>>>>>>> 
>>>>>>>>>>>  
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Shawn

Re: "upsert select" with "limit" clause

Reply via email to