Re: ConvertCSVToAvro taking a lot of time when passing single record as an input.

Pierre Villard Tue, 03 Apr 2018 05:34:11 -0700

Hi Mohit,

This has been fixed with https://issues.apache.org/jira/browse/NIFI-4955.
Besides, what Marks suggested with NIFI-4883
<https://issues.apache.org/jira/browse/NIFI-4883> is now merged in master.
Both will be in NiFi 1.6.0 to be released (hopefully this week).


Pierre


2018-04-03 12:37 GMT+02:00 Mohit <[email protected]>:

> Hi,
>
> I am using the ValidateRecord processor, but seems like it change the
> order of the data. When I set the property Include Header Line to true, I
> found that the record isn’t corrupted but the order is varied.
>
>
>
> For example-
>
>
>
> Actual order –
>
> bsc,cell_name,site_id,site_name,longitude,latitude,
> status,region,districts_216,area,town_city,cgi,cell_id,
> new_id,azimuth,cell_type,territory_name
>
> ATEBSC2,AC0139B,0139,0139LA_PALM,-0.14072,5.56353,Operational,Greater
> Accra Region,LA DADE-KOTOPON 
> MUNICIPAL,LA_PALM,LABADI,62001-152-1392,1392,2332401392,60,2G,ACCRA
> METRO MAIN
>
>
>
> Order after converting using CSVRecordSetWriter-
>
> cgi,latitude,territory_name,azimuth,cell_type,cell_id,
> longitude,cell_name,area,new_id,districts_216,site_name,
> town_city,bsc,site_id,region,status
>
> 62001-152-1392,5.56353,ACCRA METRO 
> MAIN,60,2G,1392,-0.14072,AC0139B,LA_PALM,2332401392,LA
> DADE-KOTOPON MUNICIPAL,0139LA_PALM,LABADI,ATEBSC2,0139,Greater Accra
> Region,Operational
>
>
>
> Is there any way to maintain the order of the record?
>
>
>
> Thanks,
>
> Mohit
>
>
>
> *From:* Mark Payne <[email protected]>
> *Sent:* 02 April 2018 20:23
>
> *To:* [email protected]
> *Subject:* Re: ConvertCSVToAvro taking a lot of time when passing single
> record as an input.
>
>
>
> Mohit,
>
>
>
> I see. I think this is an issue because the Avro Writer expects that the
> data must be in the proper schema,
>
> or else it will throw an Exception when trying to write the data. To
> address this, we should update ValidateRecord
>
> to support a different Record Writer to use for valid data vs. invalid
> data. There already is a JIRA [1] for this improvement.
>
>
>
> In the meantime, it probably makes sense to use a CSV Reader and a CSV
> Writer for the Validate Record processor,
>
> then use ConvertRecord only for the valid records. Or, since you're
> running into this issue it may make sense for your
>
> use case to continue on with the ConvertCSVToAvro processor for now. But
> splitting the records up to run against that
>
> Processor may result in lower performance, as you've noted.
>
>
>
> Thanks
>
> -Mark
>
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-4883
>
>
>
>
>
> On Apr 2, 2018, at 10:26 AM, Mohit <[email protected]> wrote:
>
>
>
> Mark,
>
>
>
> Error:-
>
> ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d]
> ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d] failed to process
> due to org.apache.nifi.serialization.record.util.
> IllegalTypeConversionException: Cannot convert value mohit of type class
> java.lang.String because no compatible types exist in the UNION for field
> name; rolling back session: Cannot convert value mohit of type class
> java.lang.String because no compatible types exist in the UNION for field
> name
>
>
>
> I have a file with only one record :-  mohit,25
>
> Just to check how it works, I’ve given incorrect schema: (int for string
> field)
>
> {"type":"record","name":"test","namespace":"test","fields":[
> {"name":"name","type":["null","int"],"default":null},{"name"
> :"age","type":["null","string"],"default":null}]}
>
>
>
> It doesn’t pass the record to invalid relationship. But it keeps the file
> in the queue prior to validateRecord processor.
>
>
>
> Mohit
>
>
>
>
>
> *From:* Mark Payne <[email protected]>
> *Sent:* 02 April 2018 19:53
> *To:* [email protected]
> *Subject:* Re: ConvertCSVToAvro taking a lot of time when passing single
> record as an input.
>
>
>
> What is the error that you're seeing?
>
>
>
>
>
>
> On Apr 2, 2018, at 10:22 AM, Mohit <[email protected]> wrote:
>
>
>
> Hi Mark,
>
>
>
> I tried the ValidateRecord processor, it is converting the flowfile if it
> is valid. But If the records are not valid, it is passing to the invalid
> relationship. Instead it keeps on throwing bulletins keeping the flowfile
> in the queue.
>
>
>
> Any suggestion?
>
>
>
> Mohit
>
>
>
> *From:* Mark Payne <[email protected]>
> *Sent:* 02 April 2018 19:02
> *To:* [email protected]
> *Subject:* Re: ConvertCSVToAvro taking a lot of time when passing single
> record as an input.
>
>
>
> Mohit,
>
>
>
> You can certainly dial back that number of Concurrent Tasks. Setting that
> to something like
>
> 10 is a pretty big number. Setting it to a thousand means that you'll
> likely starve out other
>
> processors that are waiting on a thread and will generally perform a lot
> worse because you have
>
> 1,000 different threads competing with each other to try to pull the next
> FlowFile.
>
>
>
> You can use the ValidateRecord processor and configure a schema that
> indicates what you expect
>
> the data to look like. Then you can route any invalid records to one route
> and valid records to another
>
> route. This will ensure that all data that goes to the 'valid'
> relationship is routed one way and any other
>
> data is routed to the 'invalid' relationship.
>
>
>
> Thanks
>
> -Mark
>
>
>
>
>
>
>
>
>
> On Apr 2, 2018, at 9:22 AM, Mohit <[email protected]> wrote:
>
>
>
> Hi Mark,
>
>
>
> The main intention to use such flow is to track bad records. The records
> which doesn’t get converted should be tracked somewhere. For that purpose
> I’m using Split-Merge approach.
>
>
>
> Meanwhile, I’m able to improve the performance by increasing the
> ‘Concurrent Tasks’ to 1000.  Now ConvertCSVToAvro is able to convert 6-7k
> per second, which though not optimum but quite better than 45-50 records
> per seconds.
>
>
>
> Is there any other improvement I can do?
>
>
>
> Mohit
>
>
>
> *From:* Mark Payne <[email protected]>
> *Sent:* 02 April 2018 18:30
> *To:* [email protected]
> *Subject:* Re: ConvertCSVToAvro taking a lot of time when passing single
> record as an input.
>
>
>
> Mohit,
>
>
>
> I agree that 45-50 records per second is quite slow. I'm not very familiar
> with the implementation of
>
> ConvertCSVToAvro, but it may well be that it must perform some sort of
> initialization for each FlowFile
>
> that it receives, which would explain why it's fast for a single incoming
> FlowFile and slow for a large number.
>
>
>
> Additionally, when you start splitting the data like that, you're
> generating a lot more FlowFiles, which means
>
> a lot more updates to both the FlowFile Repository and the Provenance
> Repository. As a result, you're basically
>
> taxing the NiFi framework far more than if you keep the data as a single
> FlowFile. On my laptop, though, I would
>
> expect more than 45-50 FlowFiles per second through most processors, but I
> don't know what kind of hardware
>
> you are running on.
>
>
>
> In general, though, it is best to keep data together instead of splitting
> it apart. Since the ConvertCSVToAvro can
>
> handle many CSV records, is there a reason to split the data to begin
> with? Also, I would recommend you look
>
> at using the Record-based processors [1][2] such as ConvertRecord instead
> of the ConvertABCtoXYZ processors, as
>
> those are older processors and often don't work as well and the
> Record-oriented processors often allow you to keep
>
> data together as a single FlowFile throughout your entire flow, which
> makes the performance far better and makes the
>
> flow much easier to design.
>
>
>
> Thanks
>
> -Mark
>
>
>
>
>
>
>
> [1] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi
>
> [2] https://bryanbende.com/development/2017/06/20/apache-
> nifi-records-and-schema-registries
>
>
>
>
>
>
>
>
> On Apr 2, 2018, at 8:49 AM, Mohit <[email protected]> wrote:
>
>
>
> Hi,
>
>
>
> I’m trying to capture bad records from ConvertCSVToAvro processor. For
> that, I’m using two SplitText processors in a row to create chunks and then
> each record per flow file.
>
>
>
> My flow is  - ListFile -> FetchFile -> SplitText(10000 records) ->
> SplitText(1 record) -> ConvertCSVToAvro -> *(futher processing)
>
>
>
> I have a 10 MB file with 15 columns per row and 64000 records. Normal flow
> (without SplitText) completes in few seconds. But when I’m using the above
> flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec).
>
> I’m not able to conclude where I’m doing wrong in the flow.
>
>
>
> I’m using Nifi 1.5.0 .
>
>
>
> Any quick input would be appreciated.
>
>
>
>
>
>
>
> Thanks,
>
> Mohit
>
>
>

Re: ConvertCSVToAvro taking a lot of time when passing single record as an input.

Reply via email to