Hi Mohit, This has been fixed with https://issues.apache.org/jira/browse/NIFI-4955. Besides, what Marks suggested with NIFI-4883 <https://issues.apache.org/jira/browse/NIFI-4883> is now merged in master. Both will be in NiFi 1.6.0 to be released (hopefully this week).
Pierre 2018-04-03 12:37 GMT+02:00 Mohit <[email protected]>: > Hi, > > I am using the ValidateRecord processor, but seems like it change the > order of the data. When I set the property Include Header Line to true, I > found that the record isn’t corrupted but the order is varied. > > > > For example- > > > > Actual order – > > bsc,cell_name,site_id,site_name,longitude,latitude, > status,region,districts_216,area,town_city,cgi,cell_id, > new_id,azimuth,cell_type,territory_name > > ATEBSC2,AC0139B,0139,0139LA_PALM,-0.14072,5.56353,Operational,Greater > Accra Region,LA DADE-KOTOPON > MUNICIPAL,LA_PALM,LABADI,62001-152-1392,1392,2332401392,60,2G,ACCRA > METRO MAIN > > > > Order after converting using CSVRecordSetWriter- > > cgi,latitude,territory_name,azimuth,cell_type,cell_id, > longitude,cell_name,area,new_id,districts_216,site_name, > town_city,bsc,site_id,region,status > > 62001-152-1392,5.56353,ACCRA METRO > MAIN,60,2G,1392,-0.14072,AC0139B,LA_PALM,2332401392,LA > DADE-KOTOPON MUNICIPAL,0139LA_PALM,LABADI,ATEBSC2,0139,Greater Accra > Region,Operational > > > > Is there any way to maintain the order of the record? > > > > Thanks, > > Mohit > > > > *From:* Mark Payne <[email protected]> > *Sent:* 02 April 2018 20:23 > > *To:* [email protected] > *Subject:* Re: ConvertCSVToAvro taking a lot of time when passing single > record as an input. > > > > Mohit, > > > > I see. I think this is an issue because the Avro Writer expects that the > data must be in the proper schema, > > or else it will throw an Exception when trying to write the data. To > address this, we should update ValidateRecord > > to support a different Record Writer to use for valid data vs. invalid > data. There already is a JIRA [1] for this improvement. > > > > In the meantime, it probably makes sense to use a CSV Reader and a CSV > Writer for the Validate Record processor, > > then use ConvertRecord only for the valid records. Or, since you're > running into this issue it may make sense for your > > use case to continue on with the ConvertCSVToAvro processor for now. But > splitting the records up to run against that > > Processor may result in lower performance, as you've noted. > > > > Thanks > > -Mark > > > > [1] https://issues.apache.org/jira/browse/NIFI-4883 > > > > > > On Apr 2, 2018, at 10:26 AM, Mohit <[email protected]> wrote: > > > > Mark, > > > > Error:- > > ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d] > ValidateRecord[id=5a9c3616-ab7c-17c1-ffff-ffffe6c2fc5d] failed to process > due to org.apache.nifi.serialization.record.util. > IllegalTypeConversionException: Cannot convert value mohit of type class > java.lang.String because no compatible types exist in the UNION for field > name; rolling back session: Cannot convert value mohit of type class > java.lang.String because no compatible types exist in the UNION for field > name > > > > I have a file with only one record :- mohit,25 > > Just to check how it works, I’ve given incorrect schema: (int for string > field) > > {"type":"record","name":"test","namespace":"test","fields":[ > {"name":"name","type":["null","int"],"default":null},{"name" > :"age","type":["null","string"],"default":null}]} > > > > It doesn’t pass the record to invalid relationship. But it keeps the file > in the queue prior to validateRecord processor. > > > > Mohit > > > > > > *From:* Mark Payne <[email protected]> > *Sent:* 02 April 2018 19:53 > *To:* [email protected] > *Subject:* Re: ConvertCSVToAvro taking a lot of time when passing single > record as an input. > > > > What is the error that you're seeing? > > > > > > > On Apr 2, 2018, at 10:22 AM, Mohit <[email protected]> wrote: > > > > Hi Mark, > > > > I tried the ValidateRecord processor, it is converting the flowfile if it > is valid. But If the records are not valid, it is passing to the invalid > relationship. Instead it keeps on throwing bulletins keeping the flowfile > in the queue. > > > > Any suggestion? > > > > Mohit > > > > *From:* Mark Payne <[email protected]> > *Sent:* 02 April 2018 19:02 > *To:* [email protected] > *Subject:* Re: ConvertCSVToAvro taking a lot of time when passing single > record as an input. > > > > Mohit, > > > > You can certainly dial back that number of Concurrent Tasks. Setting that > to something like > > 10 is a pretty big number. Setting it to a thousand means that you'll > likely starve out other > > processors that are waiting on a thread and will generally perform a lot > worse because you have > > 1,000 different threads competing with each other to try to pull the next > FlowFile. > > > > You can use the ValidateRecord processor and configure a schema that > indicates what you expect > > the data to look like. Then you can route any invalid records to one route > and valid records to another > > route. This will ensure that all data that goes to the 'valid' > relationship is routed one way and any other > > data is routed to the 'invalid' relationship. > > > > Thanks > > -Mark > > > > > > > > > > On Apr 2, 2018, at 9:22 AM, Mohit <[email protected]> wrote: > > > > Hi Mark, > > > > The main intention to use such flow is to track bad records. The records > which doesn’t get converted should be tracked somewhere. For that purpose > I’m using Split-Merge approach. > > > > Meanwhile, I’m able to improve the performance by increasing the > ‘Concurrent Tasks’ to 1000. Now ConvertCSVToAvro is able to convert 6-7k > per second, which though not optimum but quite better than 45-50 records > per seconds. > > > > Is there any other improvement I can do? > > > > Mohit > > > > *From:* Mark Payne <[email protected]> > *Sent:* 02 April 2018 18:30 > *To:* [email protected] > *Subject:* Re: ConvertCSVToAvro taking a lot of time when passing single > record as an input. > > > > Mohit, > > > > I agree that 45-50 records per second is quite slow. I'm not very familiar > with the implementation of > > ConvertCSVToAvro, but it may well be that it must perform some sort of > initialization for each FlowFile > > that it receives, which would explain why it's fast for a single incoming > FlowFile and slow for a large number. > > > > Additionally, when you start splitting the data like that, you're > generating a lot more FlowFiles, which means > > a lot more updates to both the FlowFile Repository and the Provenance > Repository. As a result, you're basically > > taxing the NiFi framework far more than if you keep the data as a single > FlowFile. On my laptop, though, I would > > expect more than 45-50 FlowFiles per second through most processors, but I > don't know what kind of hardware > > you are running on. > > > > In general, though, it is best to keep data together instead of splitting > it apart. Since the ConvertCSVToAvro can > > handle many CSV records, is there a reason to split the data to begin > with? Also, I would recommend you look > > at using the Record-based processors [1][2] such as ConvertRecord instead > of the ConvertABCtoXYZ processors, as > > those are older processors and often don't work as well and the > Record-oriented processors often allow you to keep > > data together as a single FlowFile throughout your entire flow, which > makes the performance far better and makes the > > flow much easier to design. > > > > Thanks > > -Mark > > > > > > > > [1] https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi > > [2] https://bryanbende.com/development/2017/06/20/apache- > nifi-records-and-schema-registries > > > > > > > > > On Apr 2, 2018, at 8:49 AM, Mohit <[email protected]> wrote: > > > > Hi, > > > > I’m trying to capture bad records from ConvertCSVToAvro processor. For > that, I’m using two SplitText processors in a row to create chunks and then > each record per flow file. > > > > My flow is - ListFile -> FetchFile -> SplitText(10000 records) -> > SplitText(1 record) -> ConvertCSVToAvro -> *(futher processing) > > > > I have a 10 MB file with 15 columns per row and 64000 records. Normal flow > (without SplitText) completes in few seconds. But when I’m using the above > flow, ConvertCSVToAvro processor works drastically slow(45-50 rec/sec). > > I’m not able to conclude where I’m doing wrong in the flow. > > > > I’m using Nifi 1.5.0 . > > > > Any quick input would be appreciated. > > > > > > > > Thanks, > > Mohit > > >
