Shawn, that sounds like a bug, I would file a JIRA. On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <shawnli...@gmail.com> wrote:
> Hi Vincent & William, > > > > Below is the explain plan, both are PARALLEL excuted in plan: > > > > explain upsert into table1 select * from table2; > > > > UPSERT > SELECT > | > > CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN > FULL SCAN OVER table2 > > > > explain upsert into table1 select * from table2 limit 2000000; > > > > UPSERT > SELECT > | > > | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND > ROBIN FULL SCAN OVER table2 | > > | SERVER 2000000 ROW > LIMIT > | > > | CLIENT 2000000 ROW LIMIT > > > > > > > > Thanks, > > Shawn > > On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentp...@apache.org wrote: > >> Shawn, >> >> Can you do an "explain" to show what your two statements are doing? That >> might give some clues. Perhaps one is able to be run on the server for >> some reason and the other is not. >> Otherwise, I don't see why one would be substantially slower than the >> other. >> >> Vincent >> >> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnli...@gmail.com> wrote: >> >>> Hi Jonathan, >>> >>> The single threaded on one side sounds logical to me. Hopefully Vincent >>> can confirm it. >>> >>> Thanks, >>> Shawn >>> >>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonat...@gmail.com> >>> wrote: >>> >>>> My guess is that in order to enforce the limit that it’s effectively >>>> single threaded in either the select or the upsert. >>>> >>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote: >>>> >>>> Hi Vincent, >>>> >>>> Thanks for explaining. That makes much more sense now and it explains >>>> the high memory usage when without "limit" clause. Because it upserts much >>>> quickly when using "upsert select" without "limit", the memory usage in >>>> client machine is much higher than "upsert select" with "limit" . >>>> >>>> So back to the other question. Can you explain what is underlying >>>> Phoenix implementation for "upsert select limit"? Why it is slower than >>>> without "limit" when insert a huge number (2m rows) like ""upsert into >>>> table2 select * from table1 limit 2,000,000;". This is much slower than >>>> inserting the whole table (upsert into table2 select * from table1;). >>>> >>>> Thanks, >>>> Xiang >>>> >>>> >>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org> >>>> wrote: >>>> >>>>> Shawn, >>>>> Your query *upsert into table2 select * from table1; *would not be >>>>> run on the server - the source and target table are different. It would >>>>> have to be something like: >>>>> *upsert into table1 select * from table1;* >>>>> >>>>> If you want to run server-side upsert select on a target table that is >>>>> different from the source table, you need to set >>>>> "phoenix.client.enable.server.upsert.select" to true on your client. >>>>> The are some other restrictions: the table can't have any global >>>>> indexes, and the statement can't have a join or where subquery. We need >>>>> to >>>>> update the documentation with this information. >>>>> >>>>> The reason there are all these hurdles is because it's generally not >>>>> recommended to do server-side upsert select across different tables, >>>>> because that means you're doing cross-regionserver RPCs (e.g. read data >>>>> from a region of sourcetable, and write to a region of targettable on a >>>>> different regionserver), potentially tying up handlers in an unpredictable >>>>> way. >>>>> >>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> wrote: >>>>> >>>>>> Hi Jaanai, >>>>>> >>>>>> According to Phoenix website, " If auto commit is on, and both a) >>>>>> the target table matches the source table, and b) the select performs no >>>>>> aggregation, then the population of the target table will be done >>>>>> completely on the server-side (with constraint violations logged, but >>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if >>>>>> auto >>>>>> commit is on, committed in row batches as specified by the >>>>>> UpsertBatchSize >>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config >>>>>> property which defaults to 10000 rows)" >>>>>> >>>>>> And our sql statement is just: *upsert into table2 select * from >>>>>> table1; *which should match the first case, all operations should be >>>>>> in server site. But the memory usage on the client machine is higher than >>>>>> "upsert select limit" clause. And the memory usage is check by run 'top' >>>>>> command under Linux. So we are sure it is caused by "select upsert" in >>>>>> Phoenix and not others, and can't explain why there is so high memory >>>>>> usage >>>>>> on client/gateway machine when all operations are supposed to happen on >>>>>> the >>>>>> serve side. >>>>>> >>>>>> Thanks, >>>>>> Shawn >>>>>> >>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.pos...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Shawn, >>>>>>> >>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit >>>>>>> clause, only query 1 table, the query is doing aggregation, no sequences >>>>>>> and auto commit is on. Please check your SQL ... and you can also check >>>>>>> whether some resources have not been released. >>>>>>> >>>>>>> ---------------------------------------- >>>>>>> Jaanai Zhang >>>>>>> Best regards! >>>>>>> >>>>>>> >>>>>>> >>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道: >>>>>>> >>>>>>>> Hi Jaanai, >>>>>>>> >>>>>>>> Thanks for putting your thought. The behavior you describe is >>>>>>>> correct on the Hbase region sever side. The memory usage for >>>>>>>> blockcache and >>>>>>>> memstore will be high under such high throughput. But our phoenix >>>>>>>> client is >>>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase >>>>>>>> service on it), so not sure how to explain such high memory usage for >>>>>>>> upsert select without "limit" clause. The high memory usage behavior >>>>>>>> like >>>>>>>> all select results send to client machine, cached in client machine's >>>>>>>> memory, and then insert back to target table, which is not like the >>>>>>>> behavior that should happen, all of this should be done on the server >>>>>>>> side >>>>>>>> as the table schema is exactly the same. By the way, this happens on >>>>>>>> both >>>>>>>> Phoenix 4.7 and Phoenix 4.14. >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Shawn >>>>>>>> >>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang < >>>>>>>> cloud.pos...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Shawn, >>>>>>>>> >>>>>>>>> >>>>>>>>> For the upsert without limit, which will read the source table >>>>>>>>> and write the target tables on the server side. I think the higher >>>>>>>>> memory >>>>>>>>> usage is caused by using scan cache and memstore under the higher >>>>>>>>> throughput. >>>>>>>>> >>>>>>>>> ---------------------------------------- >>>>>>>>> Jaanai Zhang >>>>>>>>> Best regards! >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道: >>>>>>>>> >>>>>>>>>> Hi Vincent, >>>>>>>>>> >>>>>>>>>> So you describe limit will sent result to client side and then >>>>>>>>>> write to server, this might explain why upsert with limit is slower >>>>>>>>>> than >>>>>>>>>> without limit. But looks like it can't explain the memory usage? The >>>>>>>>>> memory >>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit), >>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Shawn >>>>>>>>>> >>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon < >>>>>>>>>> vincentp...@apache.org> wrote: >>>>>>>>>> >>>>>>>>>>> I think it's done client-side if you have LIMIT. If you have >>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to >>>>>>>>>>> upsert >>>>>>>>>>> 100, if you have more than one regionserver. So instead results >>>>>>>>>>> are sent >>>>>>>>>>> back to the client, where the LIMIT is applied and then written >>>>>>>>>>> back to the >>>>>>>>>>> server in the UPSERT. >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Vincent, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The table creation statement is similar to below. We have about >>>>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population ( >>>>>>>>>>>> >>>>>>>>>>>> state CHAR(2) NOT NULL, >>>>>>>>>>>> >>>>>>>>>>>> city VARCHAR, >>>>>>>>>>>> >>>>>>>>>>>> population BIGINT, >>>>>>>>>>>> >>>>>>>>>>>> … >>>>>>>>>>>> >>>>>>>>>>>> CONSTRAINT my_pk PRIMARY KEY (state)); >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Shawn >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon < >>>>>>>>>>>> vincentp...@apache.org wrote: >>>>>>>>>>>> >>>>>>>>>>>>> For #2, can you provide the table definition and the statement >>>>>>>>>>>>> used? e.g. Is the table immutable, or does it have indexes? >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li < >>>>>>>>>>>>> shawnli...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. Want to check what is underlying running for limit >>>>>>>>>>>>>> clause used in the following Upsert statement (is it involving >>>>>>>>>>>>>> any >>>>>>>>>>>>>> coprocessor working behind?): >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> * upsert into table2 select >>>>>>>>>>>>>> * from table1 limit 3000000; * (table 1 and table 2 have >>>>>>>>>>>>>> same schema) >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> The above statement is running a lot slower >>>>>>>>>>>>>> than without “limit” clause as shown in following, even the >>>>>>>>>>>>>> above >>>>>>>>>>>>>> statement upsert less data: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> * upsert into table2 select * >>>>>>>>>>>>>> from table1;* >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2. We also observe memory usable is pretty high >>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big table >>>>>>>>>>>>>> it can reach >>>>>>>>>>>>>> 20gb without using limit clause. According to phoenix website >>>>>>>>>>>>>> description >>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the target >>>>>>>>>>>>>> table >>>>>>>>>>>>>> matches the source table, and b) the select performs no >>>>>>>>>>>>>> aggregation, then >>>>>>>>>>>>>> the population of the target table will be done completely on the >>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise >>>>>>>>>>>>>> ignored).” >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> My question is If everything is done on >>>>>>>>>>>>>> server-side, how come we have such high memory usage on the >>>>>>>>>>>>>> client machine? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Shawn >>>>>>>>>>>>>> >>>>>>>>>>>>>