Mark,
Error:-
ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d]
ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d] failed to process due
to org.apache.nifi.serialization.record.util.IllegalTypeConversionException:
Cannot convert value mohit of type class java.lang.String because no compatible
types exist in the UNION for field name; rolling back session: Cannot convert
value mohit of type class java.lang.String because no compatible types exist in
the UNION for field name
I have a file with only one record :- mohit,25
Just to check how it works, I’ve given incorrect schema: (int for string field)
{"type":"record","name":"test","namespace":"test","fields":[{"name":"name","type":["null","int"],"default":null},{"name":"age","type":["null","string"],"default":null}]}
It doesn’t pass the record to invalid relationship. But it keeps the file in
the queue prior to validateRecord processor.
Mohit
From: Mark Payne <[email protected]>
Sent: 02 April 2018 19:53
To: [email protected]
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record
as an input.
What is the error that you're seeing?
On Apr 2, 2018, at 10:22 AM, Mohit <[email protected]
<mailto:[email protected]> > wrote:
Hi Mark,
I tried the ValidateRecord processor, it is converting the flowfile if it is
valid. But If the records are not valid, it is passing to the invalid
relationship. Instead it keeps on throwing bulletins keeping the flowfile in
the queue.
Any suggestion?
Mohit
From: Mark Payne <[email protected] <mailto:[email protected]> >
Sent: 02 April 2018 19:02
To: [email protected] <mailto:[email protected]>
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record
as an input.
Mohit,
You can certainly dial back that number of Concurrent Tasks. Setting that to
something like
10 is a pretty big number. Setting it to a thousand means that you'll likely
starve out other
processors that are waiting on a thread and will generally perform a lot worse
because you have
1,000 different threads competing with each other to try to pull the next
FlowFile.
You can use the ValidateRecord processor and configure a schema that indicates
what you expect
the data to look like. Then you can route any invalid records to one route and
valid records to another
route. This will ensure that all data that goes to the 'valid' relationship is
routed one way and any other
data is routed to the 'invalid' relationship.
Thanks
-Mark
On Apr 2, 2018, at 9:22 AM, Mohit < <mailto:[email protected]>
[email protected]> wrote:
Hi Mark,
The main intention to use such flow is to track bad records. The records which
doesn’t get converted should be tracked somewhere. For that purpose I’m using
Split-Merge approach.
Meanwhile, I’m able to improve the performance by increasing the ‘Concurrent
Tasks’ to 1000. Now ConvertCSVToAvro is able to convert 6-7k per second, which
though not optimum but quite better than 45-50 records per seconds.
Is there any other improvement I can do?
Mohit
From: Mark Payne < <mailto:[email protected]> [email protected]>
Sent: 02 April 2018 18:30
To: <mailto:[email protected]> [email protected]
Subject: Re: ConvertCSVToAvro taking a lot of time when passing single record
as an input.
Mohit,
I agree that 45-50 records per second is quite slow. I'm not very familiar with
the implementation of
ConvertCSVToAvro, but it may well be that it must perform some sort of
initialization for each FlowFile
that it receives, which would explain why it's fast for a single incoming
FlowFile and slow for a large number.
Additionally, when you start splitting the data like that, you're generating a
lot more FlowFiles, which means
a lot more updates to both the FlowFile Repository and the Provenance
Repository. As a result, you're basically
taxing the NiFi framework far more than if you keep the data as a single
FlowFile. On my laptop, though, I would
expect more than 45-50 FlowFiles per second through most processors, but I
don't know what kind of hardware
you are running on.
In general, though, it is best to keep data together instead of splitting it
apart. Since the ConvertCSVToAvro can
handle many CSV records, is there a reason to split the data to begin with?
Also, I would recommend you look
at using the Record-based processors [1][2] such as ConvertRecord instead of
the ConvertABCtoXYZ processors, as
those are older processors and often don't work as well and the Record-oriented
processors often allow you to keep
data together as a single FlowFile throughout your entire flow, which makes the
performance far better and makes the
flow much easier to design.
Thanks
-Mark
[1] <https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi>
https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
[2]
<https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries>
https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
On Apr 2, 2018, at 8:49 AM, Mohit < <mailto:[email protected]>
[email protected]> wrote:
Hi,
I’m trying to capture bad records from ConvertCSVToAvro processor. For that,
I’m using two SplitText processors in a row to create chunks and then each
record per flow file.
My flow is - ListFile -> FetchFile -> SplitText(10000 records) -> SplitText(1
record) -> ConvertCSVToAvro -> *(futher processing)
I have a 10 MB file with 15 columns per row and 64000 records. Normal flow
(without SplitText) completes in few seconds. But when I’m using the above
flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec).
I’m not able to conclude where I’m doing wrong in the flow.
I’m using Nifi 1.5.0 .
Any quick input would be appreciated.
Thanks,
Mohit