Hi Mark, I was missing this bit!
Thanks a lot, correlation attribute name is indeed what I wanted! Giovanni > -----Original Message----- > From: Mark Payne [mailto:[email protected]] > Sent: Tuesday, November 29, 2016 4:16 PM > To: [email protected] > Subject: Re: Keep attributes when merging > > Giovanni, > > In the scenario that you laid out here, the merged FlowFile will not have a > 'dt' > attribute because there are conflicting values for the 'dt' attribute. As a > result, > the attribute is not carried through. > > If it is important to you that this attribute be carried through, you can set > the > "Correlation Attribute Name" > property to 'dt'. This will cause the processor to only bin together FlowFiles > that have the same value for the 'dt' attribute. As a result, since there > will be > no conflicting values for the attribute, the merged FlowFile will also have > this > attribute. > > Thanks > -Mark > > > > > > On Nov 29, 2016, at 9:34 AM, Giovanni Lanzani > <[email protected]> wrote: > > > > Hi Joe, > > > > I still have troubles following you. > > > > Let's assume I have the MergeContent processor with the "Keep only > common Attributes" strategy. The flow files are coming in like so: > > > > ff_1 (attribute dt = 20161120) > > ff_2 (attribute dt = 20161120) > > ff_3 (attribute dt = 20161121) > > ff_4 ((attribute dt = 20161120) > > > > If my Minimum Number of Entries in MergeContent is set to 4, what dt > attribute will the flow file coming out of the MergeContent processor have? > 20161120 or 20161121? > > > > Or is NiFi capable of waiting to have enough flow files with each > > unique value of dt before merging? If so, I think the docs could use > > some help :) > > > > From what I could see, that dt attribute was gone after the merge, but > maybe I'm doing it wrong. > > > > Cheers, > > > > Giovanni > > > > > > > >> -----Original Message----- > >> From: Joe Witt [mailto:[email protected]] > >> Sent: Tuesday, November 29, 2016 3:25 PM > >> To: [email protected] > >> Subject: Re: Keep attributes when merging > >> > >> Giovanni > >> > >> You can definitely do this. The file pulling should be retaining the > >> key path information as flow file attributes. > >> > >> The merge process has a property to control what happens with attributes. > >> The default is to only copy over matching attributes and is likely > >> what you'll want. Take a look at "Attribute Strategy". Now you want > >> to retain some key values of course and that would be the parts of > >> the timestamp you'd want to group on. You could do this with an > >> UpdateAttribute processor before MergeContent. Use that to create an > >> attribute such as "base-timestamp" or something which just pulls out the > common part of the timestamp you want. > >> In MergeContent then you can correlate on this value and since it > >> will be the same it will also be there for you afterwards. You can > >> then use this when writing to HDFS. > >> > >> This is a pretty common use case so we can definitely help you get > >> where you want to go with this. > >> > >> Thanks > >> Joe > >> > >> On Tue, Nov 29, 2016 at 9:14 AM, Giovanni Lanzani > >> <[email protected]> wrote: > >>> Hi all, > >>> > >>> I have the following use case: > >>> > >>> I'm reading xml from a folder with subfolders using the following schema: > >>> > >>> /my_folder/20161120/many xml's inside /my_folder/20161121/many > xml's > >>> inside /my_folder/201611.../many xml's inside > >>> > >>> The current pipeline involves: XML -> JSON -> Avro -> HDFS > >>> > >>> where the HDFS folder structure is > >>> > >>> /my_folder/column=20161120/many avro's inside > >>> /my_folder/column=20161121/many avro's inside > >>> /my_folder/column=201611.../many avro's inside > >>> > >>> (each column= subfolder is a Hive partition) > >>> > >>> In order to reduce the number of avro's in HDFS, I'd love to merge 'em > >>> all. > >>> > >>> However, as NiFi just reads files from the source folders without > >>> any > >> assumption on from which folders they're taken, even if I extract the > >> date from the folder name (or file), this gets lost when using > >> MergeContent. Using the Defragment strategy does not seems like an > >> option, as I don't know in advance how many files I'll see. > >>> > >>> That said: isn't there any way to accomplish what I want to do? > >>> > >>> Current strategy is to simply merge the files "manually" using > >>> avro-tools and > >> bash scripting. > >>> > >>> An alternative (although this is forcing what we want to do), is to > >>> partition by > >> import date. Then I'd only need to take care of the midnight issue, > >> for example by scheduling NiFi to fetch from the source every 10 > >> minutes, but by doing a MergeContent every 5. > >>> > >>> If something isn't clear, please let me know. > >>> > >>> Thanks, > >>> > >>> Giovanni > >>> > >>> Thanks, > >>> > >>> Giovanni
