RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.

Mohit Tue, 03 Apr 2018 06:02:30 -0700

Hi Mike,


I intentionally did this, just to check how the processor handles invalid 
record.

 

Thanks,

Mohit

From: Mike Thomsen <[email protected]> 
Sent: 03 April 2018 18:17
To: [email protected]
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record 
as an input.

 

Mohit,

 

Looking at your schema: 

 

{

               "type": "record",

               "name": "test",

               "namespace": "test",

               "fields": [{

                              "name": "name",

                              "type": ["null", "int"],

                              "default": null

               }, {

                              "name": "age",

                              "type": ["null", "string"],

                              "default": null

               }]

}

 

It looks like you have your fields' types backwards. (Name should be string, 
age should be int)

 

On Tue, Apr 3, 2018 at 8:41 AM, Mohit <[email protected] 
<mailto:[email protected]> > wrote:

Pierre,

Thanks for the information. This would be really helpful if Nifi 1.6.0 releases 
this week. There is a lot of pending tasks dependent on it. 😊

 

Mohit

 

From: Pierre Villard <[email protected] 
<mailto:[email protected]> > 
Sent: 03 April 2018 18:03


To: [email protected] <mailto:[email protected]> 
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record 
as an input.

 

Hi Mohit,

This has been fixed with https://issues.apache.org/jira/browse/NIFI-4955.

Besides, what Marks suggested with NIFI-4883 
<https://issues.apache.org/jira/browse/NIFI-4883>  is now merged in master.

Both will be in NiFi 1.6.0 to be released (hopefully this week).

Pierre

 

 

2018-04-03 12:37 GMT+02:00 Mohit <[email protected] 
<mailto:[email protected]> >:

Hi,

I am using the ValidateRecord processor, but seems like it change the order of 
the data. When I set the property Include Header Line to true, I found that the 
record isn’t corrupted but the order is varied.

 

For example- 

 

Actual order –

bsc,cell_name,site_id,site_name,longitude,latitude,status,region,districts_216,area,town_city,cgi,cell_id,new_id,azimuth,cell_type,territory_name

ATEBSC2,AC0139B,0139,0139LA_PALM,-0.14072,5.56353,Operational,Greater Accra 
Region,LA DADE-KOTOPON 
MUNICIPAL,LA_PALM,LABADI,62001-152-1392,1392,2332401392,60,2G,ACCRA METRO MAIN

 

Order after converting using CSVRecordSetWriter- 

cgi,latitude,territory_name,azimuth,cell_type,cell_id,longitude,cell_name,area,new_id,districts_216,site_name,town_city,bsc,site_id,region,status

62001-152-1392,5.56353,ACCRA METRO 
MAIN,60,2G,1392,-0.14072,AC0139B,LA_PALM,2332401392,LA DADE-KOTOPON 
MUNICIPAL,0139LA_PALM,LABADI,ATEBSC2,0139,Greater Accra Region,Operational

 

Is there any way to maintain the order of the record? 

 

Thanks,

Mohit

 

From: Mark Payne <[email protected] <mailto:[email protected]> > 
Sent: 02 April 2018 20:23


To: [email protected] <mailto:[email protected]> 
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record 
as an input.

 

Mohit,

 

I see. I think this is an issue because the Avro Writer expects that the data 
must be in the proper schema,

or else it will throw an Exception when trying to write the data. To address 
this, we should update ValidateRecord

to support a different Record Writer to use for valid data vs. invalid data. 
There already is a JIRA [1] for this improvement.

 

In the meantime, it probably makes sense to use a CSV Reader and a CSV Writer 
for the Validate Record processor,

then use ConvertRecord only for the valid records. Or, since you're running 
into this issue it may make sense for your

use case to continue on with the ConvertCSVToAvro processor for now. But 
splitting the records up to run against that

Processor may result in lower performance, as you've noted.

 

Thanks

-Mark

 

[1] https://issues.apache.org/jira/browse/NIFI-4883

 

 

On Apr 2, 2018, at 10:26 AM, Mohit <[email protected] 
<mailto:[email protected]> > wrote:

 

Mark,

 

Error:- 

ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d] 
ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d] failed to process due 
to org.apache.nifi.serialization.record.util.IllegalTypeConversionException: 
Cannot convert value mohit of type class java.lang.String because no compatible 
types exist in the UNION for field name; rolling back session: Cannot convert 
value mohit of type class java.lang.String because no compatible types exist in 
the UNION for field name

 

I have a file with only one record :-  mohit,25

Just to check how it works, I’ve given incorrect schema: (int for string field)

{"type":"record","name":"test","namespace":"test","fields":[{"name":"name","type":["null","int"],"default":null},{"name":"age","type":["null","string"],"default":null}]}

 

It doesn’t pass the record to invalid relationship. But it keeps the file in 
the queue prior to validateRecord processor.

 

Mohit

 

 

From: Mark Payne <[email protected] <mailto:[email protected]> > 
Sent: 02 April 2018 19:53
To: [email protected] <mailto:[email protected]> 
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record 
as an input.

 

What is the error that you're seeing? 

 

 

On Apr 2, 2018, at 10:22 AM, Mohit < <mailto:[email protected]> 
[email protected]> wrote:

 

Hi Mark, 

 

I tried the ValidateRecord processor, it is converting the flowfile if it is 
valid. But If the records are not valid, it is passing to the invalid 
relationship. Instead it keeps on throwing bulletins keeping the flowfile in 
the queue.

 

Any suggestion?

 

Mohit

 

From: Mark Payne < <mailto:[email protected]> [email protected]> 
Sent: 02 April 2018 19:02
To:  <mailto:[email protected]> [email protected]
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record 
as an input.

 

Mohit,

 

You can certainly dial back that number of Concurrent Tasks. Setting that to 
something like

10 is a pretty big number. Setting it to a thousand means that you'll likely 
starve out other

processors that are waiting on a thread and will generally perform a lot worse 
because you have

1,000 different threads competing with each other to try to pull the next 
FlowFile.

 

You can use the ValidateRecord processor and configure a schema that indicates 
what you expect

the data to look like. Then you can route any invalid records to one route and 
valid records to another

route. This will ensure that all data that goes to the 'valid' relationship is 
routed one way and any other

data is routed to the 'invalid' relationship.

 

Thanks

-Mark

 

 





On Apr 2, 2018, at 9:22 AM, Mohit < <mailto:[email protected]> 
[email protected]> wrote:

 

Hi Mark,

 

The main intention to use such flow is to track bad records. The records which 
doesn’t get converted should be tracked somewhere. For that purpose I’m using 
Split-Merge approach.

 

Meanwhile, I’m able to improve the performance by increasing the ‘Concurrent 
Tasks’ to 1000.  Now ConvertCSVToAvro is able to convert 6-7k per second, which 
though not optimum but quite better than 45-50 records per seconds. 

 

Is there any other improvement I can do?

 

Mohit

 

From: Mark Payne < <mailto:[email protected]> [email protected]> 
Sent: 02 April 2018 18:30
To:  <mailto:[email protected]> [email protected]
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record 
as an input.

 

Mohit, 

 

I agree that 45-50 records per second is quite slow. I'm not very familiar with 
the implementation of

ConvertCSVToAvro, but it may well be that it must perform some sort of 
initialization for each FlowFile

that it receives, which would explain why it's fast for a single incoming 
FlowFile and slow for a large number.

 

Additionally, when you start splitting the data like that, you're generating a 
lot more FlowFiles, which means

a lot more updates to both the FlowFile Repository and the Provenance 
Repository. As a result, you're basically

taxing the NiFi framework far more than if you keep the data as a single 
FlowFile. On my laptop, though, I would

expect more than 45-50 FlowFiles per second through most processors, but I 
don't know what kind of hardware

you are running on.

 

In general, though, it is best to keep data together instead of splitting it 
apart. Since the ConvertCSVToAvro can

handle many CSV records, is there a reason to split the data to begin with? 
Also, I would recommend you look

at using the Record-based processors [1][2] such as ConvertRecord instead of 
the ConvertABCtoXYZ processors, as

those are older processors and often don't work as well and the Record-oriented 
processors often allow you to keep

data together as a single FlowFile throughout your entire flow, which makes the 
performance far better and makes the

flow much easier to design.

 

Thanks

-Mark

 

 

 

[1]  <https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi> 
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi

[2]  
<https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries>
 
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries

 






On Apr 2, 2018, at 8:49 AM, Mohit < <mailto:[email protected]> 
[email protected]> wrote:

 

Hi,

 

I’m trying to capture bad records from ConvertCSVToAvro processor. For that, 
I’m using two SplitText processors in a row to create chunks and then each 
record per flow file.

 

My flow is  - ListFile -> FetchFile -> SplitText(10000 records) -> SplitText(1 
record) -> ConvertCSVToAvro -> *(futher processing)

 

I have a 10 MB file with 15 columns per row and 64000 records. Normal flow 
(without SplitText) completes in few seconds. But when I’m using the above 
flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec).

I’m not able to conclude where I’m doing wrong in the flow. 

 

I’m using Nifi 1.5.0 .

 

Any quick input would be appreciated.

 

 

 

Thanks,

Mohit

RE: ConvertCSVToAvro taking a lot of time when passing single record as an input.

Reply via email to