Hi Austin, 
I've extracted the csv header line as an attribute, and hashed the attribute to 
shorten list of headers (I shortened the checksum even further by taking the 
first 6 characters).  Using the hashed attribute, I routed identical csv 
headers to the same location for further processing.  This helped when the 
incoming schemas were not known ahead of time. 

Lee


> On Mar 16, 2017, at 3:50 PM, Pierre Villard <[email protected]> 
> wrote:
> 
> Hi Austin,
> 
> I believe the RouteOnContent processor won't try to only match the first line 
> of your flow file but will try to match the whole content. Since your flow 
> file contains multiple lines, the regular expression you are using won't work.
> 
> I'd do the following: use ExtractText to get the header as an attribute, and 
> then use RouteOnAttribute and expression language to route the files with the 
> regular expression you suggested. Here is a gist [1] with a template I used 
> to test the approach.
> 
> [1] https://gist.github.com/pvillard31/d69f5bada8e8b66ae09969ee2cce7e5f
> 
> Hope this helps,
> Pierre
> 
> 
> 2017-03-16 21:47 GMT+01:00 Austin Heyne <[email protected]>:
>> Sure, I've stripped things down to eliminate some variables and I think I 
>> have the problem cornered. I created CSV with a header of "csvA" and another 
>> with "csvB". I'm matching with regex "^csvA$" and "^csvB$" respectively. You 
>> may see what the problem is right way. Nifi doesn't seem to like '$' as an 
>> end of line marker. I've tried both "csvB" and "^csvB" which work fine and 
>> "csvB$" fails, however, this isn't strict enough for our purposes. We may 
>> have cases were one file is "col1, col2" and another is "col1,col2,col3". 
>> This could cause duplicates ingests later eating resources. Is there a way 
>> to mark the end of line in the regex or am I going to have to do a nested 
>> regex filter or something else?
>> 
>> Thanks for the help,
>> 
>> Austin
>>> On 03/16/2017 02:27 PM, James Wing wrote:
>>> Austin,
>>> 
>>> I think you are on the right track with RouteOnContent.  Any chance you can 
>>> share a sample CSV header, the settings of your RouteOnContent processor, 
>>> including the regex?
>>> 
>>> Thanks,
>>> 
>>> James
>>> 
>>>> On Thu, Mar 16, 2017 at 11:14 AM, Austin Heyne <[email protected]> wrote:
>>>> Hi,
>>>> 
>>>> I have a set of CSV files with headers that utilize various schemas. I'd 
>>>> like to route the CSV files to processors based on the schema set in the 
>>>> header. I've tried using the RouteOnContent processor to sort the files 
>>>> based on "content must contain match" and a regex statement that matches 
>>>> the first line (header). However, this is throwing an 'unmatched' on every 
>>>> file I send through.
>>>> 
>>>> I've also looked at the ValidateCsv processor but it doesn't appear that 
>>>> works with the header but rather just validates data types. Unfortunately 
>>>> this won't work as columns with the same data type could be in a different 
>>>> order.
>>>> 
>>>> Is there a ready made solution for this problem that I missed or perhaps a 
>>>> more clever way to approach it?
>>>> 
>>>> Thanks,
>>>> 
>>>> Austin Heyne
>>>> 
>>> 
>> 
> 

Reply via email to