Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Lee Laim Mon, 17 Oct 2016 23:13:15 -0700

Prabhu, 

You might also try to replace ExtractText with a series of ExecuteStreamCommand 
processors that perform system calls (sed/awk/grep or the Windows equivalents) 
on the flowfiles contents.  You can even write the result directly to a 
flowfile attribute.


I suspect there are wildcards in your ExtractText regex that are taking a while 
to buffer and compare.  

Lee 

On Oct 18, 2016, at 2:31 PM, prabhu Mahendran <[email protected]> wrote:

> Mark,
> 
> Thanks for your response.
> 
> Please find the response for your questions.
> 
> ==>The first processor that you see that exhibits poor performance is 
> ExtractText, correct?
>                              Yes,Extract Text exhibits poor performance.
> 
> ==>How big is your Java heap?
>                             I have set 1 GB for java heap.
> 
> ==>Do you have back pressure configured on the connection between ExtractText 
> and ReplaceText?
>                            There is no back pressure between extract and 
> replace text.
> 
> ==>when you say that you specify concurrent tasks, what are you configuring 
> the concurrent tasks
> to be?
>                           I have specify concurrent tasks to be 2 for the 
> extract text processor due to slower processing rate.Which                    
>        is specified in Concurrent Task Text box.
> 
> ==>Have you changed the maximum number of concurrent tasks available to your 
> dataflow?
>                          No i haven't changed.
> 
> ==>How many CPU's are available on this machine?
>                         Only single cpu are available in this machine with 
> core i5 processor CPU @2.20Ghz.
> 
> ==> Are these the only processors in your flow, or do you have other 
> dataflows going on in the
> same instance as NiFi?
>                        Yes this is the only processor in work flow which is 
> running and no other instances are running.
> 
> Thanks
> 
>> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <[email protected]> wrote:
>> Prabhu,
>> 
>> Certainly, the performance that you are seeing, taking 4-5 hours to move 3M 
>> rows into SQLServer is far from
>> ideal, but the good news is that it is also far from typical. You should be 
>> able to see far better results.
>> 
>> To help us understand what is limiting the performance, and to make sure 
>> that we understand what you are seeing, 
>> I have a series of questions that would help us to understand what is going 
>> on.
>> 
>> The first processor that you see that exhibits poor performance is 
>> ExtractText, correct?
>> Can you share the configuration that you have for that processor?
>> 
>> How big is your Java heap? This is configured in conf/bootstrap.conf; by 
>> default it is configured as:
>> java.arg.2=-Xms512m
>> java.arg.3=-Xmx512m
>> 
>> Do you have backpressure configured on the connection between ExtractText 
>> and ReplaceText?
>> 
>> Also, when you say that you specify concurrent tasks, what are you 
>> configuring the concurrent tasks
>> to be? Have you changed the maximum number of concurrent tasks available to 
>> your dataflow? By default, NiFi will
>> use only 10 threads max. How many CPU's are available on this machine?
>> 
>> And finally, are these the only processors in your flow, or do you have 
>> other dataflows going on in the
>> same instance as NiFi?
>> 
>> Thanks
>> -Mark
>> 
>> 
>>> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <[email protected]> 
>>> wrote:
>>> 
>>> Hi All,
>>> 
>>> I have tried to perform the below operation.
>>> 
>>> dat file(input)-->JSON-->SQL-->SQLServer
>>> 
>>> 
>>> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-->ConvertJsonToSQL-->PutSQL.
>>> 
>>> My Input File(.dat)-->3,00,000 rows.
>>> 
>>> Objective: Move the data from '.dat' file into SQLServer.
>>> 
>>> I can able to Store the data in SQL Server by using combination of above 
>>> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
>>> 
>>> Combination of SplitText's perform data read quickly.But Extract Text takes 
>>> long time to pass given data matches with user defined expression.If input 
>>> comes 107 MB but it send outputs in KB size only even ReplaceText processor 
>>> also processing data in KB Size only.
>>> 
>>> In accordance with above slow processing leads the more time taken for data 
>>> into SQLsever. 
>>> 
>>> 
>>> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing flow 
>>> file in Kilobytes only.
>>> 
>>> If i have specify concurrent tasks for those 
>>> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and 
>>> disk usage.
>>> 
>>> It just 30 MB data ,But processors takes 6 hrs for data movement into 
>>> SQLServer.
>>> 
>>> Faced Problem is..,
>>> 
>>>        Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>>>        ExtractText,ReplaceText take long time for processing data(it send 
>>> output flowfile kb size only).
>>> Can anyone help me to solve below requirement?
>>> 
>>> Need to reduce the number of time taken by the processors for move the 
>>> lakhs of data into SQL Server.
>>> 
>>> 
>>> 
>>> If anything i'm done wrong,please help me to done it right.
>

Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Reply via email to