Re: "upsert select" with "limit" clause

Shawn Li Tue, 18 Dec 2018 12:33:59 -0800

Hi Vincent & William,



Below is the explain plan, both are PARALLEL excuted in plan:



explain upsert into table1 select * from table2;



UPSERT
SELECT
                      |

CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
FULL SCAN OVER table2



explain upsert into table1 select * from table2 limit 2000000;



UPSERT
SELECT
                      |

| CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
ROBIN FULL SCAN OVER table2 |

|     SERVER 2000000 ROW
LIMIT
                    |

| CLIENT 2000000 ROW LIMIT







Thanks,

Shawn

On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentp...@apache.org wrote:

> Shawn,
>
> Can you do an "explain" to show what your two statements are doing?  That
> might give some clues.  Perhaps one is able to be run on the server for
> some reason and the other is not.
> Otherwise, I don't see why one would be substantially slower than the
> other.
>
> Vincent
>
> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnli...@gmail.com> wrote:
>
>> Hi Jonathan,
>>
>> The single threaded on one side sounds logical to me. Hopefully Vincent
>> can confirm it.
>>
>> Thanks,
>> Shawn
>>
>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonat...@gmail.com>
>> wrote:
>>
>>> My guess is that in order to enforce the limit that it’s effectively
>>> single threaded in either the select or the upsert.
>>>
>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote:
>>>
>>> Hi Vincent,
>>>
>>> Thanks for explaining. That makes much more sense now and it explains
>>> the high memory usage when without "limit" clause. Because it upserts much
>>> quickly when using "upsert select" without "limit", the memory usage in
>>> client machine is much higher than "upsert select" with "limit" .
>>>
>>> So back to the other question. Can you explain what is underlying
>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>> inserting the whole table (upsert into table2 select * from table1;).
>>>
>>> Thanks,
>>> Xiang
>>>
>>>
>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org>
>>> wrote:
>>>
>>>> Shawn,
>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>> run on the server - the source and target table are different.  It would
>>>> have to be something like:
>>>> *upsert into table1 select * from table1;*
>>>>
>>>> If you want to run server-side upsert select on a target table that is
>>>> different from the source table, you need to set
>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>> The are some other restrictions: the table can't have any global
>>>> indexes, and the statement can't have a join or where subquery.  We need to
>>>> update the documentation with this information.
>>>>
>>>> The reason there are all these hurdles is because it's generally not
>>>> recommended to do server-side upsert select across different tables,
>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>> from a region of sourcetable, and write to a region of targettable on a
>>>> different regionserver), potentially tying up handlers in an unpredictable
>>>> way.
>>>>
>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> wrote:
>>>>
>>>>> Hi Jaanai,
>>>>>
>>>>> According to Phoenix website, " If auto commit is on, and both a) the
>>>>> target table matches the source table, and b) the select performs no
>>>>> aggregation, then the population of the target table will be done
>>>>> completely on the server-side (with constraint violations logged, but
>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto
>>>>> commit is on, committed in row batches as specified by the UpsertBatchSize
>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>> property which defaults to 10000 rows)"
>>>>>
>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>> table1; *which should match the first case, all operations should be
>>>>> in server site. But the memory usage on the client machine is higher than
>>>>> "upsert select limit" clause. And the memory usage is check by run 'top'
>>>>> command under Linux. So we are sure it is caused by "select upsert" in
>>>>> Phoenix and not others, and can't explain why there is so high memory 
>>>>> usage
>>>>> on client/gateway machine when all operations are supposed to happen on 
>>>>> the
>>>>> serve side.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.pos...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Shawn,
>>>>>>
>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences
>>>>>> and auto commit is on. Please check your SQL ... and you can also check
>>>>>> whether some resources have not been released.
>>>>>>
>>>>>> ----------------------------------------
>>>>>>    Jaanai Zhang
>>>>>>    Best regards!
>>>>>>
>>>>>>
>>>>>>
>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>
>>>>>>> Hi Jaanai,
>>>>>>>
>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>> correct on the Hbase region sever side. The memory usage for blockcache 
>>>>>>> and
>>>>>>> memstore will be high under such high throughput. But our phoenix 
>>>>>>> client is
>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase
>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>> upsert select without "limit" clause. The high memory usage behavior 
>>>>>>> like
>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>> behavior that should happen, all of this should be done on the server 
>>>>>>> side
>>>>>>> as the table schema is exactly the same. By the way, this happens on 
>>>>>>> both
>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Shawn
>>>>>>>
>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>> cloud.pos...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Shawn,
>>>>>>>>
>>>>>>>>
>>>>>>>> For the upsert without limit,  which will read the source table and
>>>>>>>> write the target tables on the server side.  I think the higher memory
>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>> throughput.
>>>>>>>>
>>>>>>>> ----------------------------------------
>>>>>>>>    Jaanai Zhang
>>>>>>>>    Best regards!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>
>>>>>>>>> Hi Vincent,
>>>>>>>>>
>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>> write to server, this might explain why upsert with limit is slower 
>>>>>>>>> than
>>>>>>>>> without limit. But looks like it can't explain the memory usage? The 
>>>>>>>>> memory
>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit),
>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Shawn
>>>>>>>>>
>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>> vincentp...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to 
>>>>>>>>>> upsert
>>>>>>>>>> 100, if you have more than one regionserver.  So instead results are 
>>>>>>>>>> sent
>>>>>>>>>> back to the client, where the LIMIT is applied and then written back 
>>>>>>>>>> to the
>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The table creation statement is similar to below. We have about
>>>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>
>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>
>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>
>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>
>>>>>>>>>>>       …
>>>>>>>>>>>
>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Shawn
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentp...@apache.org
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> For #2, can you provide the table definition and the statement
>>>>>>>>>>>> used?  e.g. Is the table immutable, or does it have indexes?
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>> shawnli...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1.       Want to check what is underlying running for limit
>>>>>>>>>>>>> clause used in the following Upsert statement (is it involving any
>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *                                  upsert into table2 select *
>>>>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same
>>>>>>>>>>>>> schema)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>               The above statement is running a lot slower than
>>>>>>>>>>>>> without “limit”  clause as shown in following, even the above 
>>>>>>>>>>>>> statement
>>>>>>>>>>>>> upsert less data:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *                                upsert into table2 select *
>>>>>>>>>>>>> from table1;*
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high without
>>>>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can 
>>>>>>>>>>>>> reach 20gb
>>>>>>>>>>>>> without using limit clause.  According to phoenix website 
>>>>>>>>>>>>> description for
>>>>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table 
>>>>>>>>>>>>> matches
>>>>>>>>>>>>> the source table, and b) the select performs no aggregation, then 
>>>>>>>>>>>>> the
>>>>>>>>>>>>> population of the target table will be done completely on the 
>>>>>>>>>>>>> server-side
>>>>>>>>>>>>> (with constraint violations logged, but otherwise ignored).”
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>> server-side, how come we have such high memory usage on the 
>>>>>>>>>>>>> client machine?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Reply via email to