Hi Austin, I've extracted the csv header line as an attribute, and hashed the attribute to shorten list of headers (I shortened the checksum even further by taking the first 6 characters). Using the hashed attribute, I routed identical csv headers to the same location for further processing. This helped when the incoming schemas were not known ahead of time.
Lee > On Mar 16, 2017, at 3:50 PM, Pierre Villard <[email protected]> > wrote: > > Hi Austin, > > I believe the RouteOnContent processor won't try to only match the first line > of your flow file but will try to match the whole content. Since your flow > file contains multiple lines, the regular expression you are using won't work. > > I'd do the following: use ExtractText to get the header as an attribute, and > then use RouteOnAttribute and expression language to route the files with the > regular expression you suggested. Here is a gist [1] with a template I used > to test the approach. > > [1] https://gist.github.com/pvillard31/d69f5bada8e8b66ae09969ee2cce7e5f > > Hope this helps, > Pierre > > > 2017-03-16 21:47 GMT+01:00 Austin Heyne <[email protected]>: >> Sure, I've stripped things down to eliminate some variables and I think I >> have the problem cornered. I created CSV with a header of "csvA" and another >> with "csvB". I'm matching with regex "^csvA$" and "^csvB$" respectively. You >> may see what the problem is right way. Nifi doesn't seem to like '$' as an >> end of line marker. I've tried both "csvB" and "^csvB" which work fine and >> "csvB$" fails, however, this isn't strict enough for our purposes. We may >> have cases were one file is "col1, col2" and another is "col1,col2,col3". >> This could cause duplicates ingests later eating resources. Is there a way >> to mark the end of line in the regex or am I going to have to do a nested >> regex filter or something else? >> >> Thanks for the help, >> >> Austin >>> On 03/16/2017 02:27 PM, James Wing wrote: >>> Austin, >>> >>> I think you are on the right track with RouteOnContent. Any chance you can >>> share a sample CSV header, the settings of your RouteOnContent processor, >>> including the regex? >>> >>> Thanks, >>> >>> James >>> >>>> On Thu, Mar 16, 2017 at 11:14 AM, Austin Heyne <[email protected]> wrote: >>>> Hi, >>>> >>>> I have a set of CSV files with headers that utilize various schemas. I'd >>>> like to route the CSV files to processors based on the schema set in the >>>> header. I've tried using the RouteOnContent processor to sort the files >>>> based on "content must contain match" and a regex statement that matches >>>> the first line (header). However, this is throwing an 'unmatched' on every >>>> file I send through. >>>> >>>> I've also looked at the ValidateCsv processor but it doesn't appear that >>>> works with the header but rather just validates data types. Unfortunately >>>> this won't work as columns with the same data type could be in a different >>>> order. >>>> >>>> Is there a ready made solution for this problem that I missed or perhaps a >>>> more clever way to approach it? >>>> >>>> Thanks, >>>> >>>> Austin Heyne >>>> >>> >> >
