Re: "upsert select" with "limit" clause

Vincent Poon Tue, 18 Dec 2018 14:31:43 -0800

Shawn, that sounds like a bug, I would file a JIRA.

On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <shawnli...@gmail.com> wrote:


> Hi Vincent & William,
>
>
>
> Below is the explain plan, both are PARALLEL excuted in plan:
>
>
>
> explain upsert into table1 select * from table2;
>
>
>
> UPSERT
> SELECT
>                       |
>
> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
> FULL SCAN OVER table2
>
>
>
> explain upsert into table1 select * from table2 limit 2000000;
>
>
>
> UPSERT
> SELECT
>                       |
>
> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
> ROBIN FULL SCAN OVER table2 |
>
> |     SERVER 2000000 ROW
> LIMIT
>                     |
>
> | CLIENT 2000000 ROW LIMIT
>
>
>
>
>
>
>
> Thanks,
>
> Shawn
>
> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentp...@apache.org wrote:
>
>> Shawn,
>>
>> Can you do an "explain" to show what your two statements are doing?  That
>> might give some clues.  Perhaps one is able to be run on the server for
>> some reason and the other is not.
>> Otherwise, I don't see why one would be substantially slower than the
>> other.
>>
>> Vincent
>>
>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnli...@gmail.com> wrote:
>>
>>> Hi Jonathan,
>>>
>>> The single threaded on one side sounds logical to me. Hopefully Vincent
>>> can confirm it.
>>>
>>> Thanks,
>>> Shawn
>>>
>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonat...@gmail.com>
>>> wrote:
>>>
>>>> My guess is that in order to enforce the limit that it’s effectively
>>>> single threaded in either the select or the upsert.
>>>>
>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote:
>>>>
>>>> Hi Vincent,
>>>>
>>>> Thanks for explaining. That makes much more sense now and it explains
>>>> the high memory usage when without "limit" clause. Because it upserts much
>>>> quickly when using "upsert select" without "limit", the memory usage in
>>>> client machine is much higher than "upsert select" with "limit" .
>>>>
>>>> So back to the other question. Can you explain what is underlying
>>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>
>>>> Thanks,
>>>> Xiang
>>>>
>>>>
>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org>
>>>> wrote:
>>>>
>>>>> Shawn,
>>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>>> run on the server - the source and target table are different.  It would
>>>>> have to be something like:
>>>>> *upsert into table1 select * from table1;*
>>>>>
>>>>> If you want to run server-side upsert select on a target table that is
>>>>> different from the source table, you need to set
>>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>>> The are some other restrictions: the table can't have any global
>>>>> indexes, and the statement can't have a join or where subquery.  We need 
>>>>> to
>>>>> update the documentation with this information.
>>>>>
>>>>> The reason there are all these hurdles is because it's generally not
>>>>> recommended to do server-side upsert select across different tables,
>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>>> from a region of sourcetable, and write to a region of targettable on a
>>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>>> way.
>>>>>
>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> wrote:
>>>>>
>>>>>> Hi Jaanai,
>>>>>>
>>>>>> According to Phoenix website, " If auto commit is on, and both a)
>>>>>> the target table matches the source table, and b) the select performs no
>>>>>> aggregation, then the population of the target table will be done
>>>>>> completely on the server-side (with constraint violations logged, but
>>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if 
>>>>>> auto
>>>>>> commit is on, committed in row batches as specified by the 
>>>>>> UpsertBatchSize
>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>>> property which defaults to 10000 rows)"
>>>>>>
>>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>>> table1; *which should match the first case, all operations should be
>>>>>> in server site. But the memory usage on the client machine is higher than
>>>>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>>>>> command under Linux. So we are sure it is caused by "select upsert" in
>>>>>> Phoenix and not others, and can't explain why there is so high memory 
>>>>>> usage
>>>>>> on client/gateway machine when all operations are supposed to happen on 
>>>>>> the
>>>>>> serve side.
>>>>>>
>>>>>> Thanks,
>>>>>> Shawn
>>>>>>
>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.pos...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>>
>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>>> whether some resources have not been released.
>>>>>>>
>>>>>>> ----------------------------------------
>>>>>>>    Jaanai Zhang
>>>>>>>    Best regards!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>>
>>>>>>>> Hi Jaanai,
>>>>>>>>
>>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>>> correct on the Hbase region sever side. The memory usage for 
>>>>>>>> blockcache and
>>>>>>>> memstore will be high under such high throughput. But our phoenix 
>>>>>>>> client is
>>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>>> upsert select without "limit" clause. The high memory usage behavior 
>>>>>>>> like
>>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>>> behavior that should happen, all of this should be done on the server 
>>>>>>>> side
>>>>>>>> as the table schema is exactly the same. By the way, this happens on 
>>>>>>>> both
>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>> cloud.pos...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Shawn,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> For the upsert without limit,  which will read the source table
>>>>>>>>> and write the target tables on the server side.  I think the higher 
>>>>>>>>> memory
>>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>>> throughput.
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>    Jaanai Zhang
>>>>>>>>>    Best regards!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>>
>>>>>>>>>> Hi Vincent,
>>>>>>>>>>
>>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>>> write to server, this might explain why upsert with limit is slower 
>>>>>>>>>> than
>>>>>>>>>> without limit. But looks like it can't explain the memory usage? The 
>>>>>>>>>> memory
>>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>>> vincentp...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to 
>>>>>>>>>>> upsert
>>>>>>>>>>> 100, if you have more than one regionserver.  So instead results 
>>>>>>>>>>> are sent
>>>>>>>>>>> back to the client, where the LIMIT is applied and then written 
>>>>>>>>>>> back to the
>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The table creation statement is similar to below. We have about
>>>>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>>
>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>
>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>
>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>
>>>>>>>>>>>>       …
>>>>>>>>>>>>
>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <
>>>>>>>>>>>> vincentp...@apache.org wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>>> shawnli...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>>>>> clause used in the following Upsert statement (is it involving 
>>>>>>>>>>>>>> any
>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *                                  upsert into table2 select
>>>>>>>>>>>>>> * from table1 limit 3000000; * (table 1 and table 2 have
>>>>>>>>>>>>>> same schema)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>               The above statement is running a lot slower
>>>>>>>>>>>>>> than without “limit”  clause as shown in following, even the 
>>>>>>>>>>>>>> above
>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>>>>> from table1;*
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high
>>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big table 
>>>>>>>>>>>>>> it can reach
>>>>>>>>>>>>>> 20gb without using limit clause.  According to phoenix website 
>>>>>>>>>>>>>> description
>>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the target 
>>>>>>>>>>>>>> table
>>>>>>>>>>>>>> matches the source table, and b) the select performs no 
>>>>>>>>>>>>>> aggregation, then
>>>>>>>>>>>>>> the population of the target table will be done completely on the
>>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise 
>>>>>>>>>>>>>> ignored).”
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>>> server-side, how come we have such high memory usage on the 
>>>>>>>>>>>>>> client machine?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Reply via email to