Prabu, In order to move 3M rows in 10 minutes, you'll need to process 5000 rows/second. During your 4 hour run, you were processing ~200 rows/second.
Without any new optimizations you'll need ~25 threads and sufficient memory to feed the threads. I agree with Mark and you should be able to get far more than 200 rows/second. I ran a quick test using your ExtractText regex on similar data I was able to process over 100,000 rows/minute through the extract text processor. The input data was a single row of 4 fields delimited by the "|" symbol. *You might be processing the entire dat file (instead of a single row) for each record.* *Can you check the flow file attributes and content going into ExtractText? * Here is the flow with some notes: 1.GetFile (a 30 MB .dat file consisting of 3M rows; each row is about 10 bytes) 2 SplitText -> SplitText (to break the 3M rows down to manageable chunks of 10,000 lines per flow file, then split again to 1 line per flow file) 3. ExtractText to extract the 4 fields 4. ReplaceText to generate json (You can alternatively use AttributesToJson here) 5. ConvertJSONtoSQL 6. PutSQL - (This should be true bottleneck; Index the DB well and use many threads) If my assumptions are incorrect, please let me know. Thanks, Lee On Thu, Oct 20, 2016 at 1:43 AM, Kevin Verhoeven <[email protected]> wrote: > I’m not clear on how much data you are processing, does the data(.dat) > file have 3,00,000 rows? > > > > Kevin > > > > *From:* prabhu Mahendran [mailto:[email protected]] > *Sent:* Wednesday, October 19, 2016 2:05 AM > *To:* [email protected] > *Subject:* Re: How to increase the processing speed of the ExtractText > and ReplaceText Processor? > > > > Mark, > > Thanks for the response. > > My Sample input data(.dat) like below.., > > 1|2|3|4 > 6|7|8|9 > 11|12|13|14 > > In Extract Text,i have add input row only with addition of default > properties like below screenshot. > > [image: Inline image 1] > In Replace text , > > just replace value like {"data1":"${inputrow.1}"," > data2":"${inputrow.2}","data3":"${inputrow.3}","data4":"${inputrow.4}"} > [image: Inline image 2] > > > Here there is no bulletins indicates back pressure on processors. > > Can i know prerequisites needed for move the 3,00,000 data into sql server > in duration 10-20 minutes? > What are the number of CPU' s needed? > How much heap size and perm gen size we need to set for move that data > into sql server? > > Thanks > > > > On Tue, Oct 18, 2016 at 7:05 PM, Mark Payne <[email protected]> wrote: > > Prabhu, > > > > Thanks for the details. All of this seems fairly normal. Given that you > have only a single core, > > I don't think multiple concurrent tasks will help you. Can you share your > configuration for ExtractText > > and ReplaceText? Depending on the regex'es being used, they can be > extremely expensive to evaluate. > > The regex that you mentioned in the other email - > "(.+)[|](.+)[|](.+)[|](.+)" is in fact extremely expensive. > > Any time that you have ".*" or ".+" in your regex, it is going to be > extremely expensive, especially with > > longer FlowFile content. > > > > Also, do you see any bulletins indicating that the provenance repository > is applying backpressure? Given > > that you are splitting your FlowFiles into individual lines, the > provenance repository may be under a lot > > of pressure. > > > > Another thing to check, is how much garbage collection is occurring. This > can certainly destroy your performance > > quickly. You can get this information by going to the "Summary Table" in > the top-right of the UI and then clicking the > > "System Diagnostics" link in the bottom-right corner of that Summary Table. > > > > Thanks > > -Mark > > > > > > On Oct 18, 2016, at 1:31 AM, prabhu Mahendran <[email protected]> > wrote: > > > > Mark, > > Thanks for your response. > > Please find the response for your questions. > > ==>The first processor that you see that exhibits poor performance is > ExtractText, correct? > Yes,Extract Text exhibits poor performance. > > ==>How big is your Java heap? > I have set 1 GB for java heap. > > ==>Do you have back pressure configured on the connection between > ExtractText and ReplaceText? > There is no back pressure between extract and > replace text. > > ==>when you say that you specify concurrent tasks, what are you > configuring the concurrent tasks > > to be? > I have specify concurrent tasks to be 2 for the > extract text processor due to slower processing rate.Which > is specified in Concurrent Task Text box. > > ==>Have you changed the maximum number of concurrent tasks available to > your dataflow? > No i haven't changed. > > ==>How many CPU's are available on this machine? > Only single cpu are available in this machine with > core i5 processor CPU @2.20Ghz. > > ==> Are these the only processors in your flow, or do you have other > dataflows going on in the > > same instance as NiFi? > Yes this is the only processor in work flow which > is running and no other instances are running. > > Thanks > > > > On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <[email protected]> wrote: > > Prabhu, > > > > Certainly, the performance that you are seeing, taking 4-5 hours to move > 3M rows into SQLServer is far from > > ideal, but the good news is that it is also far from typical. You should > be able to see far better results. > > > > To help us understand what is limiting the performance, and to make sure > that we understand what you are seeing, > > I have a series of questions that would help us to understand what is > going on. > > > > The first processor that you see that exhibits poor performance is > ExtractText, correct? > > Can you share the configuration that you have for that processor? > > > > How big is your Java heap? This is configured in conf/bootstrap.conf; by > default it is configured as: > > java.arg.2=-Xms512m > > java.arg.3=-Xmx512m > > > > Do you have backpressure configured on the connection between ExtractText > and ReplaceText? > > > > Also, when you say that you specify concurrent tasks, what are you > configuring the concurrent tasks > > to be? Have you changed the maximum number of concurrent tasks available > to your dataflow? By default, NiFi will > > use only 10 threads max. How many CPU's are available on this machine? > > > > And finally, are these the only processors in your flow, or do you have > other dataflows going on in the > > same instance as NiFi? > > > > Thanks > > -Mark > > > > > > On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <[email protected]> > wrote: > > > > Hi All, > > I have tried to perform the below operation. > > dat file(input)-->JSON-->SQL-->SQLServer > > > GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText--> > ConvertJsonToSQL-->PutSQL. > > My Input File(.dat)-->3,00,000 rows. > > *Objective:* Move the data from '.dat' file into SQLServer. > > I can able to Store the data in SQL Server by using combination of above > processors.But it takes almost 4-5 hrs to move complete data into SQLServer. > > Combination of SplitText's perform data read quickly.But Extract Text > takes long time to pass given data matches with user defined expression.If > input comes 107 MB but it send outputs in KB size only even ReplaceText > processor also processing data in KB Size only. > > In accordance with above slow processing leads the more time taken for > data into SQLsever. > > > Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow > file in Kilobytes only. > > If i have specify concurrent tasks for those > ExtractText,ReplaceText,ConvertJsonToSQL > then it occupy the 100% cpu and disk usage. > > It just 30 MB data ,But processors takes 6 hrs for data movement into > SQLServer. > > Faced Problem is.., > > 1. Almost 6 hrs taken for move the 3lakhs data into SQL Server. > 2. ExtractText,ReplaceText take long time for processing > data(it send output flowfile kb size only). > > Can anyone help me to solve below *requirement*? > > Need to reduce the number of time taken by the processors for move the > lakhs of data into SQL Server. > > > > If anything i'm done wrong,please help me to done it right. > > > > > > > > > > > > > >
