Re: Routing File based on CSV header schema

Lee Laim Thu, 16 Mar 2017 15:55:13 -0700

Hi Austin, 
I've extracted the csv header line as an attribute, and hashed the attribute to 
shorten list of headers (I shortened the checksum even further by taking the 
first 6 characters).  Using the hashed attribute, I routed identical csv 
headers to the same location for further processing.  This helped when the 
incoming schemas were not known ahead of time.


Lee


> On Mar 16, 2017, at 3:50 PM, Pierre Villard <[email protected]> 
> wrote:
> 
> Hi Austin,
> 
> I believe the RouteOnContent processor won't try to only match the first line 
> of your flow file but will try to match the whole content. Since your flow 
> file contains multiple lines, the regular expression you are using won't work.
> 
> I'd do the following: use ExtractText to get the header as an attribute, and 
> then use RouteOnAttribute and expression language to route the files with the 
> regular expression you suggested. Here is a gist [1] with a template I used 
> to test the approach.
> 
> [1] https://gist.github.com/pvillard31/d69f5bada8e8b66ae09969ee2cce7e5f
> 
> Hope this helps,
> Pierre
> 
> 
> 2017-03-16 21:47 GMT+01:00 Austin Heyne <[email protected]>:
>> Sure, I've stripped things down to eliminate some variables and I think I 
>> have the problem cornered. I created CSV with a header of "csvA" and another 
>> with "csvB". I'm matching with regex "^csvA$" and "^csvB$" respectively. You 
>> may see what the problem is right way. Nifi doesn't seem to like '$' as an 
>> end of line marker. I've tried both "csvB" and "^csvB" which work fine and 
>> "csvB$" fails, however, this isn't strict enough for our purposes. We may 
>> have cases were one file is "col1, col2" and another is "col1,col2,col3". 
>> This could cause duplicates ingests later eating resources. Is there a way 
>> to mark the end of line in the regex or am I going to have to do a nested 
>> regex filter or something else?
>> 
>> Thanks for the help,
>> 
>> Austin
>>> On 03/16/2017 02:27 PM, James Wing wrote:
>>> Austin,
>>> 
>>> I think you are on the right track with RouteOnContent.  Any chance you can 
>>> share a sample CSV header, the settings of your RouteOnContent processor, 
>>> including the regex?
>>> 
>>> Thanks,
>>> 
>>> James
>>> 
>>>> On Thu, Mar 16, 2017 at 11:14 AM, Austin Heyne <[email protected]> wrote:
>>>> Hi,
>>>> 
>>>> I have a set of CSV files with headers that utilize various schemas. I'd 
>>>> like to route the CSV files to processors based on the schema set in the 
>>>> header. I've tried using the RouteOnContent processor to sort the files 
>>>> based on "content must contain match" and a regex statement that matches 
>>>> the first line (header). However, this is throwing an 'unmatched' on every 
>>>> file I send through.
>>>> 
>>>> I've also looked at the ValidateCsv processor but it doesn't appear that 
>>>> works with the header but rather just validates data types. Unfortunately 
>>>> this won't work as columns with the same data type could be in a different 
>>>> order.
>>>> 
>>>> Is there a ready made solution for this problem that I missed or perhaps a 
>>>> more clever way to approach it?
>>>> 
>>>> Thanks,
>>>> 
>>>> Austin Heyne
>>>> 
>>> 
>> 
>

Re: Routing File based on CSV header schema

Reply via email to