Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

prabhu Mahendran Mon, 17 Oct 2016 23:56:01 -0700

Lee,

Thanks for your idea.



I have one doubt regarding Execute Stream that needs CommandPath and
ArgumentDelimiter.

I have given this regex (.+)[|](.+)[|](.+)[|](.+) in Extract Text processor.

How can i give this reg ex to execute Stream processor?

or

Is any other processor which having same functionality like ExtractText
processor?

Thanks

On Tue, Oct 18, 2016 at 11:42 AM, Lee Laim <[email protected]> wrote:

>
> Prabhu,
>
> You might also try to replace ExtractText with a series of
> ExecuteStreamCommand processors that perform system calls (sed/awk/grep or
> the Windows equivalents) on the flowfiles contents.  You can even write the
> result directly to a flowfile attribute.
>
> I suspect there are wildcards in your ExtractText regex that are taking a
> while to buffer and compare.
>
> Lee
>
> On Oct 18, 2016, at 2:31 PM, prabhu Mahendran <[email protected]>
> wrote:
>
> Mark,
>
> Thanks for your response.
>
> Please find the response for your questions.
>
> ==>The first processor that you see that exhibits poor performance is
> ExtractText, correct?
>                              Yes,Extract Text exhibits poor performance.
>
> ==>How big is your Java heap?
>                             I have set 1 GB for java heap.
>
> ==>Do you have back pressure configured on the connection between
> ExtractText and ReplaceText?
>                            There is no back pressure between extract and
> replace text.
>
> ==>when you say that you specify concurrent tasks, what are you
> configuring the concurrent tasks
> to be?
>                           I have specify concurrent tasks to be 2 for the
> extract text processor due to slower processing rate.Which
>           is specified in Concurrent Task Text box.
>
> ==>Have you changed the maximum number of concurrent tasks available to
> your dataflow?
>                          No i haven't changed.
>
> ==>How many CPU's are available on this machine?
>                         Only single cpu are available in this machine with
> core i5 processor CPU @2.20Ghz.
>
> ==> Are these the only processors in your flow, or do you have other
> dataflows going on in the
> same instance as NiFi?
>                        Yes this is the only processor in work flow which
> is running and no other instances are running.
>
> Thanks
>
> On Mon, Oct 17, 2016 at 6:08 PM, Mark Payne <[email protected]> wrote:
>
>> Prabhu,
>>
>> Certainly, the performance that you are seeing, taking 4-5 hours to move
>> 3M rows into SQLServer is far from
>> ideal, but the good news is that it is also far from typical. You should
>> be able to see far better results.
>>
>> To help us understand what is limiting the performance, and to make sure
>> that we understand what you are seeing,
>> I have a series of questions that would help us to understand what is
>> going on.
>>
>> The first processor that you see that exhibits poor performance is
>> ExtractText, correct?
>> Can you share the configuration that you have for that processor?
>>
>> How big is your Java heap? This is configured in conf/bootstrap.conf; by
>> default it is configured as:
>> java.arg.2=-Xms512m
>> java.arg.3=-Xmx512m
>>
>> Do you have backpressure configured on the connection between ExtractText
>> and ReplaceText?
>>
>> Also, when you say that you specify concurrent tasks, what are you
>> configuring the concurrent tasks
>> to be? Have you changed the maximum number of concurrent tasks available
>> to your dataflow? By default, NiFi will
>> use only 10 threads max. How many CPU's are available on this machine?
>>
>> And finally, are these the only processors in your flow, or do you have
>> other dataflows going on in the
>> same instance as NiFi?
>>
>> Thanks
>> -Mark
>>
>>
>> On Oct 17, 2016, at 3:35 AM, prabhu Mahendran <[email protected]>
>> wrote:
>>
>> Hi All,
>>
>> I have tried to perform the below operation.
>>
>> dat file(input)-->JSON-->SQL-->SQLServer
>>
>>
>> GetFile-->SplitText-->SplitText-->ExtractText-->ReplaceText-
>> ->ConvertJsonToSQL-->PutSQL.
>>
>> My Input File(.dat)-->3,00,000 rows.
>>
>> *Objective:* Move the data from '.dat' file into SQLServer.
>>
>> I can able to Store the data in SQL Server by using combination of above
>> processors.But it takes almost 4-5 hrs to move complete data into SQLServer.
>>
>> Combination of SplitText's perform data read quickly.But Extract Text
>> takes long time to pass given data matches with user defined expression.If
>> input comes 107 MB but it send outputs in KB size only even ReplaceText
>> processor also processing data in KB Size only.
>>
>> In accordance with above slow processing leads the more time taken for
>> data into SQLsever.
>>
>>
>> Extract Text,ReplaceText,ConvertJsonToSQL processors send's outgoing
>> flow file in Kilobytes only.
>>
>> If i have specify concurrent tasks for those
>> ExtractText,ReplaceText,ConvertJsonToSQL then it occupy the 100% cpu and
>> disk usage.
>>
>> It just 30 MB data ,But processors takes 6 hrs for data movement into
>> SQLServer.
>>
>> Faced Problem is..,
>>
>>
>>    1.        Almost 6 hrs taken for move the 3lakhs data into SQL Server.
>>    2.        ExtractText,ReplaceText take long time for processing
>>    data(it send output flowfile kb size only).
>>
>> Can anyone help me to solve below *requirement*?
>>
>> Need to reduce the number of time taken by the processors for move the
>> lakhs of data into SQL Server.
>>
>>
>>
>> If anything i'm done wrong,please help me to done it right.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

Reply via email to