Hi Vincent & William,
Below is the explain plan, both are PARALLEL excuted in plan: explain upsert into table1 select * from table2; UPSERT SELECT | CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND ROBIN FULL SCAN OVER table2 explain upsert into table1 select * from table2 limit 2000000; UPSERT SELECT | | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY ROUND ROBIN FULL SCAN OVER table2 | | SERVER 2000000 ROW LIMIT | | CLIENT 2000000 ROW LIMIT Thanks, Shawn On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentp...@apache.org wrote: > Shawn, > > Can you do an "explain" to show what your two statements are doing? That > might give some clues. Perhaps one is able to be run on the server for > some reason and the other is not. > Otherwise, I don't see why one would be substantially slower than the > other. > > Vincent > > On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnli...@gmail.com> wrote: > >> Hi Jonathan, >> >> The single threaded on one side sounds logical to me. Hopefully Vincent >> can confirm it. >> >> Thanks, >> Shawn >> >> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonat...@gmail.com> >> wrote: >> >>> My guess is that in order to enforce the limit that it’s effectively >>> single threaded in either the select or the upsert. >>> >>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote: >>> >>> Hi Vincent, >>> >>> Thanks for explaining. That makes much more sense now and it explains >>> the high memory usage when without "limit" clause. Because it upserts much >>> quickly when using "upsert select" without "limit", the memory usage in >>> client machine is much higher than "upsert select" with "limit" . >>> >>> So back to the other question. Can you explain what is underlying >>> Phoenix implementation for "upsert select limit"? Why it is slower than >>> without "limit" when insert a huge number (2m rows) like ""upsert into >>> table2 select * from table1 limit 2,000,000;". This is much slower than >>> inserting the whole table (upsert into table2 select * from table1;). >>> >>> Thanks, >>> Xiang >>> >>> >>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org> >>> wrote: >>> >>>> Shawn, >>>> Your query *upsert into table2 select * from table1; *would not be >>>> run on the server - the source and target table are different. It would >>>> have to be something like: >>>> *upsert into table1 select * from table1;* >>>> >>>> If you want to run server-side upsert select on a target table that is >>>> different from the source table, you need to set >>>> "phoenix.client.enable.server.upsert.select" to true on your client. >>>> The are some other restrictions: the table can't have any global >>>> indexes, and the statement can't have a join or where subquery. We need to >>>> update the documentation with this information. >>>> >>>> The reason there are all these hurdles is because it's generally not >>>> recommended to do server-side upsert select across different tables, >>>> because that means you're doing cross-regionserver RPCs (e.g. read data >>>> from a region of sourcetable, and write to a region of targettable on a >>>> different regionserver), potentially tying up handlers in an unpredictable >>>> way. >>>> >>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> wrote: >>>> >>>>> Hi Jaanai, >>>>> >>>>> According to Phoenix website, " If auto commit is on, and both a) the >>>>> target table matches the source table, and b) the select performs no >>>>> aggregation, then the population of the target table will be done >>>>> completely on the server-side (with constraint violations logged, but >>>>> otherwise ignored). Otherwise, data is buffered on the client and, if auto >>>>> commit is on, committed in row batches as specified by the UpsertBatchSize >>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase config >>>>> property which defaults to 10000 rows)" >>>>> >>>>> And our sql statement is just: *upsert into table2 select * from >>>>> table1; *which should match the first case, all operations should be >>>>> in server site. But the memory usage on the client machine is higher than >>>>> "upsert select limit" clause. And the memory usage is check by run 'top' >>>>> command under Linux. So we are sure it is caused by "select upsert" in >>>>> Phoenix and not others, and can't explain why there is so high memory >>>>> usage >>>>> on client/gateway machine when all operations are supposed to happen on >>>>> the >>>>> serve side. >>>>> >>>>> Thanks, >>>>> Shawn >>>>> >>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.pos...@gmail.com> >>>>> wrote: >>>>> >>>>>> Shawn, >>>>>> >>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit >>>>>> clause, only query 1 table, the query is doing aggregation, no sequences >>>>>> and auto commit is on. Please check your SQL ... and you can also check >>>>>> whether some resources have not been released. >>>>>> >>>>>> ---------------------------------------- >>>>>> Jaanai Zhang >>>>>> Best regards! >>>>>> >>>>>> >>>>>> >>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道: >>>>>> >>>>>>> Hi Jaanai, >>>>>>> >>>>>>> Thanks for putting your thought. The behavior you describe is >>>>>>> correct on the Hbase region sever side. The memory usage for blockcache >>>>>>> and >>>>>>> memstore will be high under such high throughput. But our phoenix >>>>>>> client is >>>>>>> on a gateway machine (no hbase region server sitting on it or any Hbase >>>>>>> service on it), so not sure how to explain such high memory usage for >>>>>>> upsert select without "limit" clause. The high memory usage behavior >>>>>>> like >>>>>>> all select results send to client machine, cached in client machine's >>>>>>> memory, and then insert back to target table, which is not like the >>>>>>> behavior that should happen, all of this should be done on the server >>>>>>> side >>>>>>> as the table schema is exactly the same. By the way, this happens on >>>>>>> both >>>>>>> Phoenix 4.7 and Phoenix 4.14. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Shawn >>>>>>> >>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang < >>>>>>> cloud.pos...@gmail.com> wrote: >>>>>>> >>>>>>>> Shawn, >>>>>>>> >>>>>>>> >>>>>>>> For the upsert without limit, which will read the source table and >>>>>>>> write the target tables on the server side. I think the higher memory >>>>>>>> usage is caused by using scan cache and memstore under the higher >>>>>>>> throughput. >>>>>>>> >>>>>>>> ---------------------------------------- >>>>>>>> Jaanai Zhang >>>>>>>> Best regards! >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道: >>>>>>>> >>>>>>>>> Hi Vincent, >>>>>>>>> >>>>>>>>> So you describe limit will sent result to client side and then >>>>>>>>> write to server, this might explain why upsert with limit is slower >>>>>>>>> than >>>>>>>>> without limit. But looks like it can't explain the memory usage? The >>>>>>>>> memory >>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit), >>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Shawn >>>>>>>>> >>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon < >>>>>>>>> vincentp...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> I think it's done client-side if you have LIMIT. If you have >>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to >>>>>>>>>> upsert >>>>>>>>>> 100, if you have more than one regionserver. So instead results are >>>>>>>>>> sent >>>>>>>>>> back to the client, where the LIMIT is applied and then written back >>>>>>>>>> to the >>>>>>>>>> server in the UPSERT. >>>>>>>>>> >>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Vincent, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The table creation statement is similar to below. We have about >>>>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population ( >>>>>>>>>>> >>>>>>>>>>> state CHAR(2) NOT NULL, >>>>>>>>>>> >>>>>>>>>>> city VARCHAR, >>>>>>>>>>> >>>>>>>>>>> population BIGINT, >>>>>>>>>>> >>>>>>>>>>> … >>>>>>>>>>> >>>>>>>>>>> CONSTRAINT my_pk PRIMARY KEY (state)); >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Shawn >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentp...@apache.org >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> For #2, can you provide the table definition and the statement >>>>>>>>>>>> used? e.g. Is the table immutable, or does it have indexes? >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li < >>>>>>>>>>>> shawnli...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 1. Want to check what is underlying running for limit >>>>>>>>>>>>> clause used in the following Upsert statement (is it involving any >>>>>>>>>>>>> coprocessor working behind?): >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> * upsert into table2 select * >>>>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same >>>>>>>>>>>>> schema) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The above statement is running a lot slower than >>>>>>>>>>>>> without “limit” clause as shown in following, even the above >>>>>>>>>>>>> statement >>>>>>>>>>>>> upsert less data: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> * upsert into table2 select * >>>>>>>>>>>>> from table1;* >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 2. We also observe memory usable is pretty high without >>>>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can >>>>>>>>>>>>> reach 20gb >>>>>>>>>>>>> without using limit clause. According to phoenix website >>>>>>>>>>>>> description for >>>>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table >>>>>>>>>>>>> matches >>>>>>>>>>>>> the source table, and b) the select performs no aggregation, then >>>>>>>>>>>>> the >>>>>>>>>>>>> population of the target table will be done completely on the >>>>>>>>>>>>> server-side >>>>>>>>>>>>> (with constraint violations logged, but otherwise ignored).” >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> My question is If everything is done on >>>>>>>>>>>>> server-side, how come we have such high memory usage on the >>>>>>>>>>>>> client machine? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Shawn >>>>>>>>>>>>> >>>>>>>>>>>>