Shawn, that's correct. UPSERT SELECT with the limit does it in a single thread, whereas without the limit it is done in parallel in multiple threads.
Your CSV upsert will be faster because it doesn't need to SELECT - it already has the data. You can try increasing phoenix.mutate.maxSize and phoenix.mutate.maxSizeBytes to see if that helps. On Thu, Dec 20, 2018 at 10:41 AM Shawn Li <shawnli...@gmail.com> wrote: > Hi Vincent, > > Thanks for checking the source code. You mentioned with limit UPSERT uses > serial insert, can I interpret this as single thread process? So are you > saying UPSERT without limit is faster is because it uses parallel > insert/multi threads to insert? If so, then it makes sense now for both > speed and memory usage. > > But the speed for UPSERT with limit is still pretty slow for us. The > inserting rate is about 300 rows/sec vs 3000 rows/sec without limit. Also > bulk csv data loading via psql is also use single thread (per phoenix site: > "Single-threaded client loading tool for CSV formatted data via the psql > <https://phoenix.apache.org/download.html#Loading-Data> command"), but > the insert rate can be 3000-5000 rows/sec for us. Any other reason upsert > with limit is so slow? > > Thanks, > Shawn > > On Wed, Dec 19, 2018 at 5:04 PM Vincent Poon <vincentp...@apache.org> > wrote: > >> Shawn, >> >> Took a quick look, I think what is happening is the UPSERT is done >> serially when you have LIMIT. >> Parallel scans are issued for the SELECT, which is why the explain plan >> shows PARALLEL, but then the results are concatenated via a single >> LimitingResultIterator, in order to apply the CLIENT LIMIT. >> The upsert then reads from that iterator and does the mutations in >> batches. >> >> To insert in parallel, we would need some sort of shared state between >> the writing threads to ensure we respect the limit, and I don't think we >> currently have something like that. >> >> Vincent >> >> On Tue, Dec 18, 2018 at 2:31 PM Vincent Poon <vincentp...@apache.org> >> wrote: >> >>> >>> Shawn, that sounds like a bug, I would file a JIRA. >>> >>> On Tue, Dec 18, 2018 at 12:33 PM Shawn Li <shawnli...@gmail.com> wrote: >>> >>>> Hi Vincent & William, >>>> >>>> >>>> >>>> Below is the explain plan, both are PARALLEL excuted in plan: >>>> >>>> >>>> >>>> explain upsert into table1 select * from table2; >>>> >>>> >>>> >>>> UPSERT >>>> SELECT >>>> | >>>> >>>> CLIENT 27-CHUNK 915799 ROWS 2831155510 BYTES PARALLEL 18-WAY ROUND >>>> ROBIN FULL SCAN OVER table2 >>>> >>>> >>>> >>>> explain upsert into table1 select * from table2 limit 2000000; >>>> >>>> >>>> >>>> UPSERT >>>> SELECT >>>> | >>>> >>>> | CLIENT 27-CHUNK 36000000 ROWS 481140000000 BYTES PARALLEL 18-WAY >>>> ROUND ROBIN FULL SCAN OVER table2 | >>>> >>>> | SERVER 2000000 ROW >>>> LIMIT >>>> | >>>> >>>> | CLIENT 2000000 ROW LIMIT >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Shawn >>>> >>>> On Tue, Dec 18, 2018, 13:30 Vincent Poon <vincentp...@apache.org wrote: >>>> >>>>> Shawn, >>>>> >>>>> Can you do an "explain" to show what your two statements are doing? >>>>> That might give some clues. Perhaps one is able to be run on the server >>>>> for some reason and the other is not. >>>>> Otherwise, I don't see why one would be substantially slower than the >>>>> other. >>>>> >>>>> Vincent >>>>> >>>>> On Mon, Dec 17, 2018 at 9:14 PM Shawn Li <shawnli...@gmail.com> wrote: >>>>> >>>>>> Hi Jonathan, >>>>>> >>>>>> The single threaded on one side sounds logical to me. Hopefully >>>>>> Vincent can confirm it. >>>>>> >>>>>> Thanks, >>>>>> Shawn >>>>>> >>>>>> On Mon, Dec 17, 2018 at 9:25 PM Jonathan Leech <jonat...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> My guess is that in order to enforce the limit that it’s effectively >>>>>>> single threaded in either the select or the upsert. >>>>>>> >>>>>>> On Dec 17, 2018, at 6:43 PM, Shawn Li <shawnli...@gmail.com> wrote: >>>>>>> >>>>>>> Hi Vincent, >>>>>>> >>>>>>> Thanks for explaining. That makes much more sense now and it >>>>>>> explains the high memory usage when without "limit" clause. Because it >>>>>>> upserts much quickly when using "upsert select" without "limit", the >>>>>>> memory >>>>>>> usage in client machine is much higher than "upsert select" with >>>>>>> "limit" . >>>>>>> >>>>>>> So back to the other question. Can you explain what is underlying >>>>>>> Phoenix implementation for "upsert select limit"? Why it is slower than >>>>>>> without "limit" when insert a huge number (2m rows) like ""upsert into >>>>>>> table2 select * from table1 limit 2,000,000;". This is much slower than >>>>>>> inserting the whole table (upsert into table2 select * from table1;). >>>>>>> >>>>>>> Thanks, >>>>>>> Xiang >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 17, 2018 at 1:56 PM Vincent Poon <vincentp...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Shawn, >>>>>>>> Your query *upsert into table2 select * from table1; *would not >>>>>>>> be run on the server - the source and target table are different. It >>>>>>>> would >>>>>>>> have to be something like: >>>>>>>> *upsert into table1 select * from table1;* >>>>>>>> >>>>>>>> If you want to run server-side upsert select on a target table that >>>>>>>> is different from the source table, you need to set >>>>>>>> "phoenix.client.enable.server.upsert.select" to true on your client. >>>>>>>> The are some other restrictions: the table can't have any global >>>>>>>> indexes, and the statement can't have a join or where subquery. We >>>>>>>> need to >>>>>>>> update the documentation with this information. >>>>>>>> >>>>>>>> The reason there are all these hurdles is because it's generally >>>>>>>> not recommended to do server-side upsert select across different >>>>>>>> tables, >>>>>>>> because that means you're doing cross-regionserver RPCs (e.g. read data >>>>>>>> from a region of sourcetable, and write to a region of targettable on a >>>>>>>> different regionserver), potentially tying up handlers in an >>>>>>>> unpredictable >>>>>>>> way. >>>>>>>> >>>>>>>> On Sun, Dec 16, 2018 at 7:12 PM Shawn Li <shawnli...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Jaanai, >>>>>>>>> >>>>>>>>> According to Phoenix website, " If auto commit is on, and both a) >>>>>>>>> the target table matches the source table, and b) the select performs >>>>>>>>> no >>>>>>>>> aggregation, then the population of the target table will be done >>>>>>>>> completely on the server-side (with constraint violations logged, but >>>>>>>>> otherwise ignored). Otherwise, data is buffered on the client and, if >>>>>>>>> auto >>>>>>>>> commit is on, committed in row batches as specified by the >>>>>>>>> UpsertBatchSize >>>>>>>>> connection property (or the phoenix.mutate.upsertBatchSize HBase >>>>>>>>> config >>>>>>>>> property which defaults to 10000 rows)" >>>>>>>>> >>>>>>>>> And our sql statement is just: *upsert into table2 select * from >>>>>>>>> table1; *which should match the first case, all operations should >>>>>>>>> be in server site. But the memory usage on the client machine is >>>>>>>>> higher >>>>>>>>> than "upsert select limit" clause. And the memory usage is check by >>>>>>>>> run >>>>>>>>> 'top' command under Linux. So we are sure it is caused by "select >>>>>>>>> upsert" >>>>>>>>> in Phoenix and not others, and can't explain why there is so high >>>>>>>>> memory >>>>>>>>> usage on client/gateway machine when all operations are supposed to >>>>>>>>> happen >>>>>>>>> on the serve side. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Shawn >>>>>>>>> >>>>>>>>> On Thu, Dec 13, 2018 at 3:15 AM Jaanai Zhang < >>>>>>>>> cloud.pos...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Shawn, >>>>>>>>>> >>>>>>>>>> The UPSERT SELECT will run in a coprocessor on if it hasn't limit >>>>>>>>>> clause, only query 1 table, the query is doing aggregation, no >>>>>>>>>> sequences >>>>>>>>>> and auto commit is on. Please check your SQL ... and you can also >>>>>>>>>> check >>>>>>>>>> whether some resources have not been released. >>>>>>>>>> >>>>>>>>>> ---------------------------------------- >>>>>>>>>> Jaanai Zhang >>>>>>>>>> Best regards! >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 下午12:10写道: >>>>>>>>>> >>>>>>>>>>> Hi Jaanai, >>>>>>>>>>> >>>>>>>>>>> Thanks for putting your thought. The behavior you describe is >>>>>>>>>>> correct on the Hbase region sever side. The memory usage for >>>>>>>>>>> blockcache and >>>>>>>>>>> memstore will be high under such high throughput. But our phoenix >>>>>>>>>>> client is >>>>>>>>>>> on a gateway machine (no hbase region server sitting on it or any >>>>>>>>>>> Hbase >>>>>>>>>>> service on it), so not sure how to explain such high memory usage >>>>>>>>>>> for >>>>>>>>>>> upsert select without "limit" clause. The high memory usage >>>>>>>>>>> behavior like >>>>>>>>>>> all select results send to client machine, cached in client >>>>>>>>>>> machine's >>>>>>>>>>> memory, and then insert back to target table, which is not like the >>>>>>>>>>> behavior that should happen, all of this should be done on the >>>>>>>>>>> server side >>>>>>>>>>> as the table schema is exactly the same. By the way, this happens >>>>>>>>>>> on both >>>>>>>>>>> Phoenix 4.7 and Phoenix 4.14. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Shawn >>>>>>>>>>> >>>>>>>>>>> On Wed, Dec 12, 2018 at 10:26 PM Jaanai Zhang < >>>>>>>>>>> cloud.pos...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Shawn, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> For the upsert without limit, which will read the source table >>>>>>>>>>>> and write the target tables on the server side. I think the >>>>>>>>>>>> higher memory >>>>>>>>>>>> usage is caused by using scan cache and memstore under the higher >>>>>>>>>>>> throughput. >>>>>>>>>>>> >>>>>>>>>>>> ---------------------------------------- >>>>>>>>>>>> Jaanai Zhang >>>>>>>>>>>> Best regards! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Shawn Li <shawnli...@gmail.com> 于2018年12月13日周四 上午10:13写道: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Vincent, >>>>>>>>>>>>> >>>>>>>>>>>>> So you describe limit will sent result to client side and then >>>>>>>>>>>>> write to server, this might explain why upsert with limit is >>>>>>>>>>>>> slower than >>>>>>>>>>>>> without limit. But looks like it can't explain the memory usage? >>>>>>>>>>>>> The memory >>>>>>>>>>>>> usage on client machine is 8gb (without "limit") vs 2gb (with >>>>>>>>>>>>> limit), >>>>>>>>>>>>> sometime upsert without "limit" can even reach 20gb for big table. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Shawn >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Dec 12, 2018 at 6:34 PM Vincent Poon < >>>>>>>>>>>>> vincentp...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I think it's done client-side if you have LIMIT. If you have >>>>>>>>>>>>>> e.g. LIMIT 1000 , it would be incorrect for each regionserver to >>>>>>>>>>>>>> upsert >>>>>>>>>>>>>> 100, if you have more than one regionserver. So instead results >>>>>>>>>>>>>> are sent >>>>>>>>>>>>>> back to the client, where the LIMIT is applied and then written >>>>>>>>>>>>>> back to the >>>>>>>>>>>>>> server in the UPSERT. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Dec 12, 2018 at 1:18 PM Shawn Li < >>>>>>>>>>>>>> shawnli...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Vincent, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The table creation statement is similar to below. We have >>>>>>>>>>>>>>> about 200 fields. Table is mutable and don’t have any index on >>>>>>>>>>>>>>> the table. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> CREATE TABLE IF NOT EXISTS us_population ( >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> state CHAR(2) NOT NULL, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> city VARCHAR, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> population BIGINT, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> … >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> CONSTRAINT my_pk PRIMARY KEY (state)); >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Shawn >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Dec 12, 2018, 13:42 Vincent Poon < >>>>>>>>>>>>>>> vincentp...@apache.org wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For #2, can you provide the table definition and the >>>>>>>>>>>>>>>> statement used? e.g. Is the table immutable, or does it have >>>>>>>>>>>>>>>> indexes? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, Dec 11, 2018 at 6:08 PM Shawn/Xiang Li < >>>>>>>>>>>>>>>> shawnli...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 1. Want to check what is underlying running for >>>>>>>>>>>>>>>>> limit clause used in the following Upsert statement (is it >>>>>>>>>>>>>>>>> involving any >>>>>>>>>>>>>>>>> coprocessor working behind?): >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> * upsert into table2 >>>>>>>>>>>>>>>>> select * from table1 limit 3000000; * (table 1 and table >>>>>>>>>>>>>>>>> 2 have same schema) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The above statement is running a lot slower >>>>>>>>>>>>>>>>> than without “limit” clause as shown in following, even the >>>>>>>>>>>>>>>>> above >>>>>>>>>>>>>>>>> statement upsert less data: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> * upsert into table2 select >>>>>>>>>>>>>>>>> * from table1;* >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2. We also observe memory usable is pretty high >>>>>>>>>>>>>>>>> without the limit clause (8gb vs 2gb), sometimes for big >>>>>>>>>>>>>>>>> table it can reach >>>>>>>>>>>>>>>>> 20gb without using limit clause. According to phoenix >>>>>>>>>>>>>>>>> website description >>>>>>>>>>>>>>>>> for upsert select “If auto commit is on, and both a) the >>>>>>>>>>>>>>>>> target table >>>>>>>>>>>>>>>>> matches the source table, and b) the select performs no >>>>>>>>>>>>>>>>> aggregation, then >>>>>>>>>>>>>>>>> the population of the target table will be done completely on >>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>> server-side (with constraint violations logged, but otherwise >>>>>>>>>>>>>>>>> ignored).” >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My question is If everything is done on >>>>>>>>>>>>>>>>> server-side, how come we have such high memory usage on the >>>>>>>>>>>>>>>>> client machine? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Shawn >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>