Re: "upsert select" with "limit" clause

Shawn Li Mon, 17 Dec 2018 21:14:24 -0800

Hi Jonathan,

The single threaded on one side sounds logical to me. Hopefully Vincent can
confirm it.


Thanks,
Shawn

On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <[email protected]> wrote:

> My guess is that in order to enforce the limit that it’s effectively
> single threaded in either the select or the upsert.
>
> On Dec 17, 2018, at 6:43 PM, Shawn Li <[email protected]> wrote:
>
> Hi Vincent,
>
> Thanks for explaining. That makes much more sense now and it explains the
> high memory usage when without "limit" clause. Because it upserts much
> quickly when using "upsert select" without "limit", the memory usage in
> client machine is much higher than "upsert select" with "limit" .
>
> So back to the other question. Can you explain what is underlying Phoenix
> implementation for "upsert select limit"? Why it is slower than without
> "limit" when insert a huge number (2m rows) like ""upsert into table2
> select * from table1 limit 2,000,000;". This is much slower than inserting
> the whole table (upsert into table2 select * from table1;).
>
> Thanks,
> Xiang
>
>
> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <[email protected]>
> wrote:
>
>> Shawn,
>> Your query *upsert into table2 select * from table1;  *would not be run
>> on the server - the source and target table are different.  It would have
>> to be something like:
>> *upsert into table1 select * from table1;*
>>
>> If you want to run server-side upsert select on a target table that is
>> different from the source table, you need to set
>> "phoenix.client.enable.server.upsert.select" to true on your client.
>> The are some other restrictions: the table can't have any global indexes,
>> and the statement can't have a join or where subquery.  We need to update
>> the documentation with this information.
>>
>> The reason there are all these hurdles is because it's generally not
>> recommended to do server-side upsert select across different tables,
>> because that means you're doing cross-regionserver RPCs (e.g. read data
>> from a region of sourcetable, and write to a region of targettable on a
>> different regionserver), potentially tying up handlers in an unpredictable
>> way.
>>
>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <[email protected]> wrote:
>>
>>> Hi Jaanai,
>>>
>>> According to Phoenix website, " If auto commit is on, and both a) the
>>> target table matches the source table, and b) the select performs no
>>> aggregation, then the population of the target table will be done
>>> completely on the server-side (with constraint violations logged, but
>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>> property which defaults to 10000 rows)"
>>>
>>> And our sql statement is just: *upsert into table2 select * from
>>> table1; *which should match the first case, all operations should be in
>>> server site. But the memory usage on the client machine is higher than
>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>> command under Linux. So we are sure it is caused by "select upsert" in
>>> Phoenix and not others, and can't explain why there is so high memory usage
>>> on client/gateway machine when all operations are supposed to happen on the
>>> serve side.
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <[email protected]>
>>> wrote:
>>>
>>>> Shawn,
>>>>
>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>> whether some resources have not been released.
>>>>
>>>> ----------------------------------------
>>>>    Jaanai Zhang
>>>>    Best regards!
>>>>
>>>>
>>>>
>>>> Shawn Li <[email protected]> 于2018年12月13日周四 下午12:10写道：
>>>>
>>>>> Hi Jaanai,
>>>>>
>>>>> Thanks for putting your thought. The behavior you describe is correct
>>>>> on the Hbase region sever side. The memory usage for blockcache and
>>>>> memstore will be high under such high throughput. But our phoenix client 
>>>>> is
>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>> service on it), so not sure how to explain such high memory usage for
>>>>> upsert select without "limit" clause. The high memory usage behavior like
>>>>> all select results send to client machine, cached in client machine's
>>>>> memory, and then insert back to target table, which is not like the
>>>>> behavior that should happen, all of this should be done on the server side
>>>>> as the table schema is exactly the same. By the way, this happens on both
>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Shawn,
>>>>>>
>>>>>>
>>>>>> For the upsert without limit,  which will read the source table and
>>>>>> write the target tables on the server side.  I think the higher memory
>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>> throughput.
>>>>>>
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Shawn Li <[email protected]> 于2018年12月13日周四 上午10:13写道：
>>>>>>
>>>>>>> Hi Vincent,
>>>>>>>
>>>>>>> So you describe limit will sent result to client side and then write
>>>>>>> to server, this might explain why upsert with limit is slower than 
>>>>>>> without
>>>>>>> limit. But looks like it can't explain the memory usage? The memory 
>>>>>>> usage
>>>>>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>>>>>>> upsert without "limit" can even reach 20gb for big table.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 
>>>>>>>> 100, if
>>>>>>>> you have more than one regionserver.  So instead results are sent back 
>>>>>>>> to
>>>>>>>> the client, where the LIMIT is applied and then written back to the 
>>>>>>>> server
>>>>>>>> in the UPSERT.
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Vincent,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The table creation statement is similar to below. We have about
>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>
>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>
>>>>>>>>>       city VARCHAR,
>>>>>>>>>
>>>>>>>>>       population BIGINT,
>>>>>>>>>
>>>>>>>>>       …
>>>>>>>>>
>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <[email protected]
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>
>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>>>> schema)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>>>> without “limit”  clause as shown in following, even the above 
>>>>>>>>>>> statement
>>>>>>>>>>> upsert less data:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>> from table1;*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2.       We also observe memory usable is pretty high without
>>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can reach 
>>>>>>>>>>> 20gb
>>>>>>>>>>> without using limit clause.  According to phoenix website 
>>>>>>>>>>> description for
>>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table 
>>>>>>>>>>> matches
>>>>>>>>>>> the source table, and b) the select performs no aggregation, then 
>>>>>>>>>>> the
>>>>>>>>>>> population of the target table will be done completely on the 
>>>>>>>>>>> server-side
>>>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>> server-side, how come we have such high memory usage on the client 
>>>>>>>>>>> machine?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>

Re: "upsert select" with "limit" clause

Reply via email to