Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Lee Laim Wed, 19 Oct 2016 11:15:42 -0700

Prabu,

In order to move 3M rows in 10 minutes, you'll need to process 5000
rows/second.
During your 4 hour run, you were processing ~200 rows/second.


Without any new optimizations you'll need ~25 threads and sufficient memory
to feed the threads.  I agree with Mark and you should be able to get far
more than 200 rows/second.

I ran a quick test using your ExtractText regex on similar data I was able
to process over 100,000 rows/minute through the extract text processor.
The input data was a single row of 4 fields delimited by the "|" symbol.

*You might be processing the entire dat file (instead of a single row) for
each record.*
*Can you check the flow file attributes and content going into ExtractText?
 *


Here is the flow with some notes:

1.GetFile (a 30 MB .dat file consisting of 3M rows; each row is about 10
bytes)

2 SplitText -> SplitText  (to break the 3M rows down to manageable chunks
of 10,000 lines per flow file, then split again to 1 line per flow file)

3. ExtractText to extract the 4 fields

4. ReplaceText to generate json (You can alternatively use AttributesToJson
here)

5. ConvertJSONtoSQL

6. PutSQL - (This should be true bottleneck; Index the DB well and use many
threads)


If my assumptions are incorrect, please let me know.

Thanks,
Lee

On Thu, Oct 20, 2016 at 1:43 AM, Kevin Verhoeven <[email protected]>
wrote:

> I’m not clear on how much data you are processing, does the data(.dat)
> file have 3,00,000 rows?
>
>
>
> Kevin
>
>
>
> *From:* prabhu Mahendran [mailto:[email protected]]
> *Sent:* Wednesday, October 19, 2016 2:05 AM
> *To:* [email protected]
> *Subject:* Re: How to increase the processing speed of the ExtractText
> and ReplaceText Processor?
>
>
>
> Mark,
>
> Thanks for the response.
>
> My Sample input data(.dat) like below..,
>
> 1|2|3|4
> 6|7|8|9
> 11|12|13|14
>
> In Extract Text,i have add input row only with addition of default
> properties like below screenshot.
>
> [image: Inline image 1]
> In Replace text ,
>
> just replace value like {"data1":"${inputrow.1}","
> data2":"${inputrow.2}","data3":"${inputrow.3}","data4":"${inputrow.4}"}
> [image: Inline image 2]
>
>
> Here there is no bulletins indicates back pressure on processors.
>
> Can i know prerequisites needed for move the 3,00,000 data into sql server
> in duration 10-20 minutes?
> What are the number of CPU' s needed?
> How much heap size and perm gen size we need to set for move that data
> into sql server?
>
> Thanks
>
>
>
> On Tue, Oct 18, 2016 at 7:05 PM, Mark Payne <[email protected]> wrote:
>
> Prabhu,
>
>
>
> Thanks for the details. All of this seems fairly normal. Given that you
> have only a single core,
>
> I don't think multiple concurrent tasks will help you. Can you share your
> configuration for ExtractText
>
> and ReplaceText? Depending on the regex'es being used, they can be
> extremely expensive to evaluate.
>
> The regex that you mentioned in the other email -
> "(.+)[|](.+)[|](.+)[|](.+)" is in fact extremely expensive.
>
> Any time that you have ".*" or ".+" in your regex, it is going to be
> extremely expensive, especially with
>
> longer FlowFile content.
>
>
>
> Also, do you see any bulletins indicating that the provenance repository
> is applying backpressure? Given
>
> that you are splitting your FlowFiles into individual lines, the
> provenance repository may be under a lot
>
> of pressure.
>
>
>
> Another thing to check, is how much garbage collection is occurring. This
> can certainly destroy your performance
>
> quickly. You can get this information by going to the "Summary Table" in
> the top-right of the UI and then clicking the
>
> "System Diagnostics" link in the bottom-right corner of that Summary Table.
>
>
>
> Thanks
>
> -Mark
>
>
>
>
>
> On Oct 18, 2016, at 1:31 AM, prabhu Mahendran <[email protected]>
> wrote:
>
>
>
> Mark,
>
> Thanks for your response.
>
> Please find the response for your questions.
>
> ==>The first processor that you see that exhibits poor performance is
> ExtractText, correct?
>                              Yes,Extract Text exhibits poor performance.
>
> ==>How big is your Java heap?
>                             I have set 1 GB for java heap.
>
> ==>Do you have back pressure configured on the connection between
> ExtractText and ReplaceText?
>                            There is no back pressure between extract and
> replace text.
>
> ==>when you say that you specify concurrent tasks, what are you
> configuring the concurrent tasks
>
> to be?
>                           I have specify concurrent tasks to be 2 for the
> extract text processor due to slower processing rate.Which
>           is specified in Concurrent Task Text box.
>
> ==>Have you changed the maximum number of concurrent tasks available to
> your dataflow?
>                          No i haven't changed.
>
> ==>How many CPU's are available on this machine?
>                         Only single cpu are available in this machine with
> core i5 processor CPU @2.20Ghz.
>
> ==> Are these the only processors in your flow, or do you have other
> dataflows going on in the
>
> same instance as NiFi?
>                        Yes this is the only processor in work flow which
> is running and no other instances are running.
>
> Thanks
>
>
>
> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <[email protected]> wrote:
>
> Prabhu,
>
>
>
> Certainly, the performance that you are seeing, taking 4-5 hours to move
> 3M rows into SQLServer is far from
>
> ideal, but the good news is that it is also far from typical. You should
> be able to see far better results.
>
>
>
> To help us understand what is limiting the performance, and to make sure
> that we understand what you are seeing,
>
> I have a series of questions that would help us to understand what is
> going on.
>
>
>
> The first processor that you see that exhibits poor performance is
> ExtractText, correct?
>
> Can you share the configuration that you have for that processor?
>
>
>
> How big is your Java heap? This is configured in conf/bootstrap.conf; by
> default it is configured as:
>
> java.arg.2=-Xms512m
>
> java.arg.3=-Xmx512m
>
>
>
> Do you have backpressure configured on the connection between ExtractText
> and ReplaceText?
>
>
>
> Also, when you say that you specify concurrent tasks, what are you
> configuring the concurrent tasks
>
> to be? Have you changed the maximum number of concurrent tasks available
> to your dataflow? By default, NiFi will
>
> use only 10 threads max. How many CPU's are available on this machine?
>
>
>
> And finally, are these the only processors in your flow, or do you have
> other dataflows going on in the
>
> same instance as NiFi?
>
>
>
> Thanks
>
> -Mark
>
>
>
>
>
> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <[email protected]>
> wrote:
>
>
>
> Hi All,
>
> I have tried to perform the below operation.
>
> dat file(input)-->JSON-->SQL-->SQLServer
>
>
> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->
> ConvertJsonToSQL-->PutSQL.
>
> My Input File(.dat)-->3,00,000 rows.
>
> *Objective:* Move the data from '.dat' file into SQLServer.
>
> I can able to Store the data in SQL Server by using combination of above
> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
>
> Combination of SplitText's perform data read quickly.But Extract Text
> takes long time to pass given data matches with user defined expression.If
> input comes 107 MB but it send outputs in KB size only even ReplaceText
> processor also processing data in KB Size only.
>
> In accordance with above slow processing leads the more time taken for
> data into SQLsever.
>
>
> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow
> file in Kilobytes only.
>
> If i have specify concurrent tasks for those 
> ExtractText,ReplaceText,ConvertJsonToSQL
> then it occupy the 100% cpu and disk usage.
>
> It just 30 MB data ,But processors takes 6 hrs for data movement into
> SQLServer.
>
> Faced Problem is..,
>
>    1.        Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>    2.        ExtractText,ReplaceText take long time for processing
>    data(it send output flowfile kb size only).
>
> Can anyone help me to solve below *requirement*?
>
> Need to reduce the number of time taken by the processors for move the
> lakhs of data into SQL Server.
>
>
>
> If anything i'm done wrong,please help me to done it right.
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Reply via email to