Re: "upsert select" with "limit" clause

Jaanai Zhang Wed, 19 Dec 2018 22:37:06 -0800

Shawn,

I have done some tests in the 4.14.1-HBase-1.4 version. The detail
information is as follow:


CREATE TABLE test (id VARCHAR PRIMARY KEY, c1 varchar, c2 varchar)
SALT_BUCKETS = 10;

explain select * from test where c1 = 'xx' limit 5 offset 100;

CREATE TABLE test1 (id VARCHAR PRIMARY KEY, c1 varchar, c2 varchar)
SALT_BUCKETS = 10;

explain upsert into test1 select * from test limit 10;



0: jdbc:phoenix:thin:url=http://localhost:876> explain upsert into test1
select * from test limit 10;

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

|                                       PLAN
        | EST_BYTES_READ  | EST_ROWS_READ  | EST_INFO_TS  |

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

| UPSERT SELECT
         | 2040            | 10             | 0            |

| CLIENT 10-CHUNK 10 ROWS 2040 BYTES *SERIAL* 10-WAY ROUND ROBIN FULL SCAN
OVER TEST  | 2040            | 10             | 0            |

|     SERVER 10 ROW LIMIT
         | 2040            | 10             | 0            |

| CLIENT 10 ROW LIMIT
         | 2040            | 10             | 0            |

+-----------------------------------------------------------------------------------+-----------------+----------------+--------------+

4 rows selected (0.028 seconds)

0: jdbc:phoenix:thin:url=http://localhost:876> explain upsert into test1
select * from test;

+------------------------------------------------------------------+-----------------+----------------+--------------+

|                               PLAN                               |
EST_BYTES_READ  | EST_ROWS_READ  | EST_INFO_TS  |

+------------------------------------------------------------------+-----------------+----------------+--------------+

| UPSERT SELECT                                                    | null
          | null           | null         |

| CLIENT 10-CHUNK PARALLEL 10-WAY ROUND ROBIN FULL SCAN OVER TEST  | null
          | null           | null         |

+------------------------------------------------------------------+-----------------+----------------+--------------+

2 rows selected (0.033 seconds)


I notice that the UPSERT will produce serial scans with the limit clause.
what is your Phoenix's version?   @Vincent FYI






----------------------------------------
   Jaanai Zhang
   Best regards!



Vincent Poon <vincentp...@apache.org> 于2018年12月20日周四 上午6:04写道：

> Shawn,
>
> Took a quick look, I think what is happening is the UPSERT is done
> serially when you have LIMIT.
> Parallel scans are issued for the SELECT, which is why the explain plan
> shows PARALLEL, but then the results are concatenated via a single
> LimitingResultIterator, in order to apply the CLIENT LIMIT.
> The upsert then reads from that iterator and does the mutations in batches.
>
> To insert in parallel, we would need some sort of shared state between the
> writing threads to ensure we respect the limit, and I don't think we
> currently have something like that.
>
> Vincent
>
> On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon <vincentp...@apache.org>
> wrote:
>
>>
>> Shawn, that sounds like a bug, I would file a JIRA.
>>
>> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <shawnli...@gmail.com> wrote:
>>
>>> Hi Vincent & William,
>>>
>>>
>>>
>>> Below is the explain plan, both are PARALLEL excuted in plan:
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN
>>> FULL SCAN OVER table2
>>>
>>>
>>>
>>> explain upsert into table1 select * from table2 limit 2000000;
>>>
>>>
>>>
>>> UPSERT
>>> SELECT
>>>                       |
>>>
>>> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND
>>> ROBIN FULL SCAN OVER table2 |
>>>
>>> |     SERVER 2000000 ROW
>>> LIMIT
>>>                     |
>>>
>>> | CLIENT 2000000 ROW LIMIT
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Shawn
>>>
>>> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentp...@apache.org wrote:
>>>
>>>> Shawn,
>>>>
>>>> Can you do an "explain" to show what your two statements are doing?
>>>> That might give some clues.  Perhaps one is able to be run on the server
>>>> for some reason and the other is not.
>>>> Otherwise, I don't see why one would be substantially slower than the
>>>> other.
>>>>
>>>> Vincent
>>>>
>>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnli...@gmail.com> wrote:
>>>>
>>>>> Hi Jonathan,
>>>>>
>>>>> The single threaded on one side sounds logical to me. Hopefully
>>>>> Vincent can confirm it.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonat...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> My guess is that in order to enforce the limit that it’s effectively
>>>>>> single threaded in either the select or the upsert.
>>>>>>
>>>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Vincent,
>>>>>>
>>>>>> Thanks for explaining. That makes much more sense now and it explains
>>>>>> the high memory usage when without "limit" clause. Because it upserts 
>>>>>> much
>>>>>> quickly when using "upsert select" without "limit", the memory usage in
>>>>>> client machine is much higher than "upsert select" with "limit" .
>>>>>>
>>>>>> So back to the other question. Can you explain what is underlying
>>>>>> Phoenix implementation for "upsert select limit"? Why it is slower than
>>>>>> without "limit" when insert a huge number (2m rows) like ""upsert into
>>>>>> table2 select * from table1 limit 2,000,000;". This is much slower than
>>>>>> inserting the whole table (upsert into table2 select * from table1;).
>>>>>>
>>>>>> Thanks,
>>>>>> Xiang
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Shawn,
>>>>>>> Your query *upsert into table2 select * from table1;  *would not be
>>>>>>> run on the server - the source and target table are different.  It would
>>>>>>> have to be something like:
>>>>>>> *upsert into table1 select * from table1;*
>>>>>>>
>>>>>>> If you want to run server-side upsert select on a target table that
>>>>>>> is different from the source table, you need to set
>>>>>>> "phoenix.client.enable.server.upsert.select" to true on your client.
>>>>>>> The are some other restrictions: the table can't have any global
>>>>>>> indexes, and the statement can't have a join or where subquery.  We 
>>>>>>> need to
>>>>>>> update the documentation with this information.
>>>>>>>
>>>>>>> The reason there are all these hurdles is because it's generally not
>>>>>>> recommended to do server-side upsert select across different tables,
>>>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data
>>>>>>> from a region of sourcetable, and write to a region of targettable on a
>>>>>>> different regionserver), potentially tying up handlers in an 
>>>>>>> unpredictable
>>>>>>> way.
>>>>>>>
>>>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jaanai,
>>>>>>>>
>>>>>>>> According to Phoenix website, " If auto commit is on, and both a)
>>>>>>>> the target table matches the source table, and b) the select performs 
>>>>>>>> no
>>>>>>>> aggregation, then the population of the target table will be done
>>>>>>>> completely on the server-side (with constraint violations logged, but
>>>>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if 
>>>>>>>> auto
>>>>>>>> commit is on, committed in row batches as specified by the 
>>>>>>>> UpsertBatchSize
>>>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config
>>>>>>>> property which defaults to 10000 rows)"
>>>>>>>>
>>>>>>>> And our sql statement is just: *upsert into table2 select * from
>>>>>>>> table1; *which should match the first case, all operations should
>>>>>>>> be in server site. But the memory usage on the client machine is higher
>>>>>>>> than "upsert select limit" clause. And the memory usage is check by run
>>>>>>>> 'top'  command under Linux. So we are sure it is caused by "select 
>>>>>>>> upsert"
>>>>>>>> in Phoenix and not others, and can't explain why there is so high 
>>>>>>>> memory
>>>>>>>> usage on client/gateway machine when all operations are supposed to 
>>>>>>>> happen
>>>>>>>> on the serve side.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Shawn
>>>>>>>>
>>>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <
>>>>>>>> cloud.pos...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Shawn,
>>>>>>>>>
>>>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit
>>>>>>>>> clause, only query 1 table, the query is doing aggregation, no 
>>>>>>>>> sequences
>>>>>>>>> and auto commit is on. Please check your SQL ... and you can also 
>>>>>>>>> check
>>>>>>>>> whether some resources have not been released.
>>>>>>>>>
>>>>>>>>> ----------------------------------------
>>>>>>>>>    Jaanai Zhang
>>>>>>>>>    Best regards!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道：
>>>>>>>>>
>>>>>>>>>> Hi Jaanai,
>>>>>>>>>>
>>>>>>>>>> Thanks for putting your thought. The behavior you describe is
>>>>>>>>>> correct on the Hbase region sever side. The memory usage for 
>>>>>>>>>> blockcache and
>>>>>>>>>> memstore will be high under such high throughput. But our phoenix 
>>>>>>>>>> client is
>>>>>>>>>> on a gateway machine (no hbase region server sitting on it or any 
>>>>>>>>>> Hbase
>>>>>>>>>> service on it), so not sure how to explain such high memory usage for
>>>>>>>>>> upsert select without "limit" clause. The high memory usage behavior 
>>>>>>>>>> like
>>>>>>>>>> all select results send to client machine, cached in client machine's
>>>>>>>>>> memory, and then insert back to target table, which is not like the
>>>>>>>>>> behavior that should happen, all of this should be done on the 
>>>>>>>>>> server side
>>>>>>>>>> as the table schema is exactly the same. By the way, this happens on 
>>>>>>>>>> both
>>>>>>>>>> Phoenix 4.7 and Phoenix 4.14.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Shawn
>>>>>>>>>>
>>>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <
>>>>>>>>>> cloud.pos...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Shawn,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> For the upsert without limit,  which will read the source table
>>>>>>>>>>> and write the target tables on the server side.  I think the higher 
>>>>>>>>>>> memory
>>>>>>>>>>> usage is caused by using scan cache and memstore under the higher
>>>>>>>>>>> throughput.
>>>>>>>>>>>
>>>>>>>>>>> ----------------------------------------
>>>>>>>>>>>    Jaanai Zhang
>>>>>>>>>>>    Best regards!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>
>>>>>>>>>>>> So you describe limit will sent result to client side and then
>>>>>>>>>>>> write to server, this might explain why upsert with limit is 
>>>>>>>>>>>> slower than
>>>>>>>>>>>> without limit. But looks like it can't explain the memory usage? 
>>>>>>>>>>>> The memory
>>>>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with 
>>>>>>>>>>>> limit),
>>>>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Shawn
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <
>>>>>>>>>>>> vincentp...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's done client-side if you have LIMIT.  If you have
>>>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to 
>>>>>>>>>>>>> upsert
>>>>>>>>>>>>> 100, if you have more than one regionserver.  So instead results 
>>>>>>>>>>>>> are sent
>>>>>>>>>>>>> back to the client, where the LIMIT is applied and then written 
>>>>>>>>>>>>> back to the
>>>>>>>>>>>>> server in the UPSERT.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Vincent,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The table creation statement is similar to below. We have
>>>>>>>>>>>>>> about 200 fields. Table is mutable and don’t have any index on 
>>>>>>>>>>>>>> the table.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population (
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       state CHAR(2) NOT NULL,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       city VARCHAR,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       population BIGINT,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       …
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>       CONSTRAINT my_pk PRIMARY KEY (state));
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <
>>>>>>>>>>>>>> vincentp...@apache.org wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For #2, can you provide the table definition and the
>>>>>>>>>>>>>>> statement used?  e.g. Is the table immutable, or does it have 
>>>>>>>>>>>>>>> indexes?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li <
>>>>>>>>>>>>>>> shawnli...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1.       Want to check what is underlying running for
>>>>>>>>>>>>>>>> limit clause used in the following Upsert statement (is it 
>>>>>>>>>>>>>>>> involving any
>>>>>>>>>>>>>>>> coprocessor working behind?):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                                  upsert into table2
>>>>>>>>>>>>>>>> select * from table1 limit 3000000; * (table 1 and table 2
>>>>>>>>>>>>>>>> have same schema)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>               The above statement is running a lot slower
>>>>>>>>>>>>>>>> than without “limit”  clause as shown in following, even the 
>>>>>>>>>>>>>>>> above
>>>>>>>>>>>>>>>> statement upsert less data:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *                                upsert into table2 select
>>>>>>>>>>>>>>>> * from table1;*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2.       We also observe memory usable is pretty high
>>>>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big table 
>>>>>>>>>>>>>>>> it can reach
>>>>>>>>>>>>>>>> 20gb without using limit clause.  According to phoenix website 
>>>>>>>>>>>>>>>> description
>>>>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the 
>>>>>>>>>>>>>>>> target table
>>>>>>>>>>>>>>>> matches the source table, and b) the select performs no 
>>>>>>>>>>>>>>>> aggregation, then
>>>>>>>>>>>>>>>> the population of the target table will be done completely on 
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise 
>>>>>>>>>>>>>>>> ignored).”
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>                My question is If everything is done on
>>>>>>>>>>>>>>>> server-side, how come we have such high memory usage on the 
>>>>>>>>>>>>>>>> client machine?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shawn
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: "upsert select" with "limit" clause

Reply via email to