Hi Jonathan, The single threaded on one side sounds logical to me. Hopefully Vincent can confirm it.
Thanks, Shawn On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonat...@gmail.com> wrote: > My guess is that in order to enforce the limit that it’s effectively > single threaded in either the select or the upsert. > > On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote: > > Hi Vincent, > > Thanks for explaining. That makes much more sense now and it explains the > high memory usage when without "limit" clause. Because it upserts much > quickly when using "upsert select" without "limit", the memory usage in > client machine is much higher than "upsert select" with "limit" . > > So back to the other question. Can you explain what is underlying Phoenix > implementation for "upsert select limit"? Why it is slower than without > "limit" when insert a huge number (2m rows) like ""upsert into table2 > select * from table1 limit 2,000,000;". This is much slower than inserting > the whole table (upsert into table2 select * from table1;). > > Thanks, > Xiang > > > On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org> > wrote: > >> Shawn, >> Your query *upsert into table2 select * from table1; *would not be run >> on the server - the source and target table are different. It would have >> to be something like: >> *upsert into table1 select * from table1;* >> >> If you want to run server-side upsert select on a target table that is >> different from the source table, you need to set >> "phoenix.client.enable.server.upsert.select" to true on your client. >> The are some other restrictions: the table can't have any global indexes, >> and the statement can't have a join or where subquery. We need to update >> the documentation with this information. >> >> The reason there are all these hurdles is because it's generally not >> recommended to do server-side upsert select across different tables, >> because that means you're doing cross-regionserver RPCs (e.g. read data >> from a region of sourcetable, and write to a region of targettable on a >> different regionserver), potentially tying up handlers in an unpredictable >> way. >> >> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> wrote: >> >>> Hi Jaanai, >>> >>> According to Phoenix website, " If auto commit is on, and both a) the >>> target table matches the source table, and b) the select performs no >>> aggregation, then the population of the target table will be done >>> completely on the server-side (with constraint violations logged, but >>> otherwise ignored). Otherwise, data is buffered on the client and, if auto >>> commit is on, committed in row batches as specified by the UpsertBatchSize >>> connection property (or the phoenix.mutate.upsertBatchSize HBase config >>> property which defaults to 10000 rows)" >>> >>> And our sql statement is just: *upsert into table2 select * from >>> table1; *which should match the first case, all operations should be in >>> server site. But the memory usage on the client machine is higher than >>> "upsert select limit" clause. And the memory usage is check by run 'top' >>> command under Linux. So we are sure it is caused by "select upsert" in >>> Phoenix and not others, and can't explain why there is so high memory usage >>> on client/gateway machine when all operations are supposed to happen on the >>> serve side. >>> >>> Thanks, >>> Shawn >>> >>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.pos...@gmail.com> >>> wrote: >>> >>>> Shawn, >>>> >>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit >>>> clause, only query 1 table, the query is doing aggregation, no sequences >>>> and auto commit is on. Please check your SQL ... and you can also check >>>> whether some resources have not been released. >>>> >>>> ---------------------------------------- >>>> Jaanai Zhang >>>> Best regards! >>>> >>>> >>>> >>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道: >>>> >>>>> Hi Jaanai, >>>>> >>>>> Thanks for putting your thought. The behavior you describe is correct >>>>> on the Hbase region sever side. The memory usage for blockcache and >>>>> memstore will be high under such high throughput. But our phoenix client >>>>> is >>>>> on a gateway machine (no hbase region server sitting on it or any Hbase >>>>> service on it), so not sure how to explain such high memory usage for >>>>> upsert select without "limit" clause. The high memory usage behavior like >>>>> all select results send to client machine, cached in client machine's >>>>> memory, and then insert back to target table, which is not like the >>>>> behavior that should happen, all of this should be done on the server side >>>>> as the table schema is exactly the same. By the way, this happens on both >>>>> Phoenix 4.7 and Phoenix 4.14. >>>>> >>>>> >>>>> Thanks, >>>>> Shawn >>>>> >>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cloud.pos...@gmail.com> >>>>> wrote: >>>>> >>>>>> Shawn, >>>>>> >>>>>> >>>>>> For the upsert without limit, which will read the source table and >>>>>> write the target tables on the server side. I think the higher memory >>>>>> usage is caused by using scan cache and memstore under the higher >>>>>> throughput. >>>>>> >>>>>> ---------------------------------------- >>>>>> Jaanai Zhang >>>>>> Best regards! >>>>>> >>>>>> >>>>>> >>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道: >>>>>> >>>>>>> Hi Vincent, >>>>>>> >>>>>>> So you describe limit will sent result to client side and then write >>>>>>> to server, this might explain why upsert with limit is slower than >>>>>>> without >>>>>>> limit. But looks like it can't explain the memory usage? The memory >>>>>>> usage >>>>>>> on client machine is 8gb (without "limit") vs 2gb (with limit), sometime >>>>>>> upsert without "limit" can even reach 20gb for big table. >>>>>>> >>>>>>> Thanks, >>>>>>> Shawn >>>>>>> >>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vincentp...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> I think it's done client-side if you have LIMIT. If you have e.g. >>>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert >>>>>>>> 100, if >>>>>>>> you have more than one regionserver. So instead results are sent back >>>>>>>> to >>>>>>>> the client, where the LIMIT is applied and then written back to the >>>>>>>> server >>>>>>>> in the UPSERT. >>>>>>>> >>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Vincent, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> The table creation statement is similar to below. We have about >>>>>>>>> 200 fields. Table is mutable and don’t have any index on the table. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> CREATE TABLE IF NOT EXISTS us_population ( >>>>>>>>> >>>>>>>>> state CHAR(2) NOT NULL, >>>>>>>>> >>>>>>>>> city VARCHAR, >>>>>>>>> >>>>>>>>> population BIGINT, >>>>>>>>> >>>>>>>>> … >>>>>>>>> >>>>>>>>> CONSTRAINT my_pk PRIMARY KEY (state)); >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Shawn >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentp...@apache.org >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> For #2, can you provide the table definition and the statement >>>>>>>>>> used? e.g. Is the table immutable, or does it have indexes? >>>>>>>>>> >>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li < >>>>>>>>>> shawnli...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 1. Want to check what is underlying running for limit >>>>>>>>>>> clause used in the following Upsert statement (is it involving any >>>>>>>>>>> coprocessor working behind?): >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> * upsert into table2 select * >>>>>>>>>>> from table1 limit 3000000; * (table 1 and table 2 have same >>>>>>>>>>> schema) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The above statement is running a lot slower than >>>>>>>>>>> without “limit” clause as shown in following, even the above >>>>>>>>>>> statement >>>>>>>>>>> upsert less data: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> * upsert into table2 select * >>>>>>>>>>> from table1;* >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2. We also observe memory usable is pretty high without >>>>>>>>>>> the limit clause (8gb vs 2gb), sometimes for big table it can reach >>>>>>>>>>> 20gb >>>>>>>>>>> without using limit clause. According to phoenix website >>>>>>>>>>> description for >>>>>>>>>>> upsert select “If auto commit is on, and both a) the target table >>>>>>>>>>> matches >>>>>>>>>>> the source table, and b) the select performs no aggregation, then >>>>>>>>>>> the >>>>>>>>>>> population of the target table will be done completely on the >>>>>>>>>>> server-side >>>>>>>>>>> (with constraint violations logged, but otherwise ignored).” >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> My question is If everything is done on >>>>>>>>>>> server-side, how come we have such high memory usage on the client >>>>>>>>>>> machine? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Shawn >>>>>>>>>>> >>>>>>>>>>