Giovanni

You can definitely do this.  The file pulling should be retaining the
key path information as flow file attributes.

The merge process has a property to control what happens with
attributes.  The default is to only copy over matching attributes and
is likely what you'll want.  Take a look at "Attribute Strategy".  Now
you want to retain some key values of course and that would be the
parts of the timestamp you'd want to group on.  You could do this with
an UpdateAttribute processor before MergeContent.  Use that to create
an attribute such as "base-timestamp" or something which just pulls
out the common part of the timestamp you want.  In MergeContent then
you can correlate on this value and since it will be the same it will
also be there for you afterwards.  You can then use this when writing
to HDFS.

This is a pretty common use case so we can definitely help you get
where you want to go with this.

Thanks
Joe

On Tue, Nov 29, 2016 at 9:14 AM, Giovanni Lanzani
<[email protected]> wrote:
> Hi all,
>
> I have the following use case:
>
> I'm reading xml from a folder with subfolders using the following schema:
>
> /my_folder/20161120/many xml's inside
> /my_folder/20161121/many xml's inside
> /my_folder/201611.../many xml's inside
>
> The current pipeline involves: XML -> JSON -> Avro -> HDFS
>
> where the HDFS folder structure is
>
> /my_folder/column=20161120/many avro's inside
> /my_folder/column=20161121/many avro's inside
> /my_folder/column=201611.../many avro's inside
>
> (each column= subfolder is a Hive partition)
>
> In order to reduce the number of avro's in HDFS, I'd love to merge 'em all.
>
> However, as NiFi just reads files from the source folders without any 
> assumption on from which folders they're taken, even if I extract the date 
> from the folder name (or file), this gets lost when using MergeContent. Using 
> the Defragment strategy does not seems like an option, as I don't know in 
> advance how many files I'll see.
>
> That said: isn't there any way to accomplish what I want to do?
>
> Current strategy is to simply merge the files "manually" using avro-tools and 
> bash scripting.
>
> An alternative (although this is forcing what we want to do), is to partition 
> by import date. Then I'd only need to take care of the midnight issue, for 
> example by scheduling NiFi to fetch from the source every 10 minutes, but by 
> doing a MergeContent every 5.
>
> If something isn't clear, please let me know.
>
> Thanks,
>
> Giovanni
>
> Thanks,
>
> Giovanni

Reply via email to