RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.

Mohit Mon, 02 Apr 2018 06:23:18 -0700

Hi Mark,


The main intention to use such flow is to track bad records. The records which 
doesn’t get converted should be tracked somewhere. For that purpose I’m using 
Split-Merge approach.

 

Meanwhile, I’m able to improve the performance by increasing the ‘Concurrent 
Tasks’ to 1000.  Now ConvertCSVToAvro is able to convert 6-7k per second, which 
though not optimum but quite better than 45-50 records per seconds. 

 

Is there any other improvement I can do?

 

Mohit

 

From: Mark Payne <[email protected]> 
Sent: 02 April 2018 18:30
To: [email protected]
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record 
as an input.

 

Mohit, 

 

I agree that 45-50 records per second is quite slow. I'm not very familiar with 
the implementation of

ConvertCSVToAvro, but it may well be that it must perform some sort of 
initialization for each FlowFile

that it receives, which would explain why it's fast for a single incoming 
FlowFile and slow for a large number.

 

Additionally, when you start splitting the data like that, you're generating a 
lot more FlowFiles, which means

a lot more updates to both the FlowFile Repository and the Provenance 
Repository. As a result, you're basically

taxing the NiFi framework far more than if you keep the data as a single 
FlowFile. On my laptop, though, I would

expect more than 45-50 FlowFiles per second through most processors, but I 
don't know what kind of hardware

you are running on.

 

In general, though, it is best to keep data together instead of splitting it 
apart. Since the ConvertCSVToAvro can

handle many CSV records, is there a reason to split the data to begin with? 
Also, I would recommend you look

at using the Record-based processors [1][2] such as ConvertRecord instead of 
the ConvertABCtoXYZ processors, as

those are older processors and often don't work as well and the Record-oriented 
processors often allow you to keep

data together as a single FlowFile throughout your entire flow, which makes the 
performance far better and makes the

flow much easier to design.

 

Thanks

-Mark

 

 

 

[1] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi

[2] 
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries

 





On Apr 2, 2018, at 8:49 AM, Mohit <[email protected] 
<mailto:[email protected]> > wrote:

 

Hi,

 

I’m trying to capture bad records from ConvertCSVToAvro processor. For that, 
I’m using two SplitText processors in a row to create chunks and then each 
record per flow file.

 

My flow is  - ListFile -> FetchFile -> SplitText(10000 records) -> SplitText(1 
record) -> ConvertCSVToAvro -> *(futher processing)

 

I have a 10 MB file with 15 columns per row and 64000 records. Normal flow 
(without SplitText) completes in few seconds. But when I’m using the above 
flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec).

I’m not able to conclude where I’m doing wrong in the flow. 

 

I’m using Nifi 1.5.0 .

 

Any quick input would be appreciated.

 

 

 

Thanks,

Mohit

RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.

Reply via email to