Re: "upsert select" with "limit" clause

Vincent Poon Mon, 17 Dec 2018 10:57:36 -0800

Shawn,
Your query *upsert into table2 select * from table1;  *would not be run on
the server - the source and target table are different.  It would have to
be something like:
*upsert into table1 select * from table1;*


If you want to run server-side upsert select on a target table that is
different from the source table, you need to set
"phoenix.client.enable.server.upsert.select" to true on your client.
The are some other restrictions: the table can't have any global indexes,
and the statement can't have a join or where subquery.  We need to update
the documentation with this information.

The reason there are all these hurdles is because it's generally not
recommended to do server-side upsert select across different tables,
because that means you're doing cross-regionserver RPCs (e.g. read data
from a region of sourcetable, and write to a region of targettable on a
different regionserver), potentially tying up handlers in an unpredictable
way.

On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> wrote:

> Hi Jaanai,
>
> According to Phoenix website, " If auto commit is on, and both a) the
> target table matches the source table, and b) the select performs no
> aggregation, then the population of the target table will be done
> completely on the server-side (with constraint violations logged, but
> otherwise ignored). Otherwise, data is buffered on the client and, if auto
> commit is on, committed in row batches as specified by the UpsertBatchSize
> connection property (or the phoenix.mutate.upsertBatchSize HBase config
> property which defaults to 10000 rows)"
>
> And our sql statement is just: *upsert into table2 select * from table1; 
> *which
> should match the first case, all operations should be in server site. But
> the memory usage on the client machine is higher than "upsert select limit"
> clause. And the memory usage is check by run 'top'  command under Linux. So
> we are sure it is caused by "select upsert" in Phoenix and not others, and
> can't explain why there is so high memory usage on client/gateway machine
> when all operations are supposed to happen on the serve side.
>
> Thanks,
> Shawn
>
> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.pos...@gmail.com>
> wrote:
>
>> Shawn,
>>
>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause,
>> only query 1 table, the query is doing aggregation, no sequences and auto
>> commit is on. Please check your SQL ... and you can also check whether some
>> resources have not been released.
>>
>> ----------------------------------------
>>    Jaanai Zhang
>>    Best regards!
>>
>>
>>
>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>
>>> Hi Jaanai,
>>>
>>> Thanks for putting your thought. The behavior you describe is correct on
>>> the Hbase region sever side. The memory usage for blockcache and memstore
>>> will be high under such high throughput. But our phoenix client is on a
>>> gateway machine (no hbase region server sitting on it or any Hbase service
>>> on it), so not sure how to explain such high memory usage for upsert select
>>> without "limit" clause. The high memory usage behavior like all select
>>> results send to client machine, cached in client machine's memory, and then
>>> insert back to target table, which is not like the behavior that should
>>> happen, all of this should be done on the server side as the table schema
>>> is exactly the same. By the way, this happens on both Phoenix 4.7 and
>>> Phoenix 4.14.
>>>
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cloud.pos...@gmail.com>
>>> wrote:
>>>
>>>> Shawn,
>>>>
>>>>
>>>> For the upsert without limit,  which will read the source table and
>>>> write the target tables on the server side.  I think the higher memory
>>>> usage is caused by using scan cache and memstore under the higher
>>>> throughput.
>>>>
>>>> ----------------------------------------
>>>>    Jaanai Zhang
>>>>    Best regards!
>>>>
>>>>
>>>>
>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>
>>>>> Hi Vincent,
>>>>>
>>>>> So you describe limit will sent result to client side and then write
>>>>> to server, this might explain why upsert with limit is slower than without
>>>>> limit. But looks like it can't explain the memory usage? The memory usage
>>>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime
>>>>> upsert without "limit" can even reach 20gb for big table.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vincentp...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I think it's done client-side if you have LIMIT.  If you have e.g.
>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert 100, 
>>>>>> if
>>>>>> you have more than one regionserver.  So instead results are sent back to
>>>>>> the client, where the LIMIT is applied and then written back to the 
>>>>>> server
>>>>>> in the UPSERT.
>>>>>>
>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Vincent,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> The table creation statement is similar to below. We have about 200
>>>>>>> fields. Table is mutable and don’t have any index on the table.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>
>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>
>>>>>>>       city VARCHAR,
>>>>>>>
>>>>>>>       population BIGINT,
>>>>>>>
>>>>>>>       …
>>>>>>>
>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Shawn
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentp...@apache.org
>>>>>>> wrote:
>>>>>>>
>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>
>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>> shawnli...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>> coprocessor working behind?):
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>> schema)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>> without “limit”  clause as shown in following, even the above 
>>>>>>>>> statement
>>>>>>>>> upsert less data:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *                                upsert into table2 select * from
>>>>>>>>> table1;*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2.       We also observe memory usable is pretty high without the
>>>>>>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach 20gb
>>>>>>>>> without using limit clause.  According to phoenix website description 
>>>>>>>>> for
>>>>>>>>> upsert select “If auto commit is on, and both a) the target table 
>>>>>>>>> matches
>>>>>>>>> the source table, and b) the select performs no aggregation, then the
>>>>>>>>> population of the target table will be done completely on the 
>>>>>>>>> server-side
>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                My question is If everything is done on
>>>>>>>>> server-side, how come we have such high memory usage on the client 
>>>>>>>>> machine?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>

Re: "upsert select" with "limit" clause

Reply via email to