My guess is that in order to enforce the limit that it’s effectively single threaded in either the select or the upsert.
> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote: > > Hi Vincent, > > Thanks for explaining. That makes much more sense now and it explains the > high memory usage when without "limit" clause. Because it upserts much > quickly when using "upsert select" without "limit", the memory usage in > client machine is much higher than "upsert select" with "limit" . > > So back to the other question. Can you explain what is underlying Phoenix > implementation for "upsert select limit"? Why it is slower than without > "limit" when insert a huge number (2m rows) like ""upsert into table2 select > * from table1 limit 2,000,000;". This is much slower than inserting the whole > table (upsert into table2 select * from table1;). > > Thanks, > Xiang > > >> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org> wrote: >> Shawn, >> Your query upsert into table2 select * from table1; would not be run on the >> server - the source and target table are different. It would have to be >> something like: >> upsert into table1 select * from table1; >> >> If you want to run server-side upsert select on a target table that is >> different from the source table, you need to set >> "phoenix.client.enable.server.upsert.select" to true on your client. >> The are some other restrictions: the table can't have any global indexes, >> and the statement can't have a join or where subquery. We need to update >> the documentation with this information. >> >> The reason there are all these hurdles is because it's generally not >> recommended to do server-side upsert select across different tables, because >> that means you're doing cross-regionserver RPCs (e.g. read data from a >> region of sourcetable, and write to a region of targettable on a different >> regionserver), potentially tying up handlers in an unpredictable way. >> >>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> wrote: >>> Hi Jaanai, >>> >>> According to Phoenix website, " If auto commit is on, and both a) the >>> target table matches the source table, and b) the select performs no >>> aggregation, then the population of the target table will be done >>> completely on the server-side (with constraint violations logged, but >>> otherwise ignored). Otherwise, data is buffered on the client and, if auto >>> commit is on, committed in row batches as specified by the UpsertBatchSize >>> connection property (or the phoenix.mutate.upsertBatchSize HBase config >>> property which defaults to 10000 rows)" >>> >>> And our sql statement is just: upsert into table2 select * from table1; >>> which should match the first case, all operations should be in server site. >>> But the memory usage on the client machine is higher than "upsert select >>> limit" clause. And the memory usage is check by run 'top' command under >>> Linux. So we are sure it is caused by "select upsert" in Phoenix and not >>> others, and can't explain why there is so high memory usage on >>> client/gateway machine when all operations are supposed to happen on the >>> serve side. >>> >>> Thanks, >>> Shawn >>> >>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang <cloud.pos...@gmail.com> >>>> wrote: >>>> Shawn, >>>> >>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit clause, >>>> only query 1 table, the query is doing aggregation, no sequences and auto >>>> commit is on. Please check your SQL ... and you can also check whether >>>> some resources have not been released. >>>> >>>> ---------------------------------------- >>>> Jaanai Zhang >>>> Best regards! >>>> >>>> >>>> >>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道: >>>>> Hi Jaanai, >>>>> >>>>> Thanks for putting your thought. The behavior you describe is correct on >>>>> the Hbase region sever side. The memory usage for blockcache and memstore >>>>> will be high under such high throughput. But our phoenix client is on a >>>>> gateway machine (no hbase region server sitting on it or any Hbase >>>>> service on it), so not sure how to explain such high memory usage for >>>>> upsert select without "limit" clause. The high memory usage behavior like >>>>> all select results send to client machine, cached in client machine's >>>>> memory, and then insert back to target table, which is not like the >>>>> behavior that should happen, all of this should be done on the server >>>>> side as the table schema is exactly the same. By the way, this happens on >>>>> both Phoenix 4.7 and Phoenix 4.14. >>>>> >>>>> >>>>> Thanks, >>>>> Shawn >>>>> >>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang <cloud.pos...@gmail.com> >>>>>> wrote: >>>>>> Shawn, >>>>>> >>>>>> >>>>>> For the upsert without limit, which will read the source table and >>>>>> write the target tables on the server side. I think the higher memory >>>>>> usage is caused by using scan cache and memstore under the higher >>>>>> throughput. >>>>>> >>>>>> ---------------------------------------- >>>>>> Jaanai Zhang >>>>>> Best regards! >>>>>> >>>>>> >>>>>> >>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道: >>>>>>> Hi Vincent, >>>>>>> >>>>>>> So you describe limit will sent result to client side and then write to >>>>>>> server, this might explain why upsert with limit is slower than without >>>>>>> limit. But looks like it can't explain the memory usage? The memory >>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with limit), >>>>>>> sometime upsert without "limit" can even reach 20gb for big table. >>>>>>> >>>>>>> Thanks, >>>>>>> Shawn >>>>>>> >>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon <vincentp...@apache.org> >>>>>>>> wrote: >>>>>>>> I think it's done client-side if you have LIMIT. If you have e.g. >>>>>>>> LIMIT 1000 , it would be incorrect for each regionserver to upsert >>>>>>>> 100, if you have more than one regionserver. So instead results are >>>>>>>> sent back to the client, where the LIMIT is applied and then written >>>>>>>> back to the server in the UPSERT. >>>>>>>> >>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li <shawnli...@gmail.com> wrote: >>>>>>>>> Hi Vincent, >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> The table creation statement is similar to below. We have about 200 >>>>>>>>> fields. Table is mutable and don’t have any index on the table. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> CREATE TABLE IF NOT EXISTS us_population ( >>>>>>>>> >>>>>>>>> state CHAR(2) NOT NULL, >>>>>>>>> >>>>>>>>> city VARCHAR, >>>>>>>>> >>>>>>>>> population BIGINT, >>>>>>>>> >>>>>>>>> … >>>>>>>>> >>>>>>>>> CONSTRAINT my_pk PRIMARY KEY (state)); >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Shawn >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon <vincentp...@apache.org >>>>>>>>>> wrote: >>>>>>>>>> For #2, can you provide the table definition and the statement used? >>>>>>>>>> e.g. Is the table immutable, or does it have indexes? >>>>>>>>>> >>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li >>>>>>>>>>> <shawnli...@gmail.com> wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 1. Want to check what is underlying running for limit clause >>>>>>>>>>> used in the following Upsert statement (is it involving any >>>>>>>>>>> coprocessor working behind?): >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> upsert into table2 select * from >>>>>>>>>>> table1 limit 3000000; (table 1 and table 2 have same schema) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The above statement is running a lot slower than >>>>>>>>>>> without “limit” clause as shown in following, even the above >>>>>>>>>>> statement upsert less data: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> upsert into table2 select * from >>>>>>>>>>> table1; >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2. We also observe memory usable is pretty high without the >>>>>>>>>>> limit clause (8gb vs 2gb), sometimes for big table it can reach >>>>>>>>>>> 20gb without using limit clause. According to phoenix website >>>>>>>>>>> description for upsert select “If auto commit is on, and both a) >>>>>>>>>>> the target table matches the source table, and b) the select >>>>>>>>>>> performs no aggregation, then the population of the target table >>>>>>>>>>> will be done completely on the server-side (with constraint >>>>>>>>>>> violations logged, but otherwise ignored).” >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> My question is If everything is done on server-side, >>>>>>>>>>> how come we have such high memory usage on the client machine? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Shawn