Hi all, I have the following use case:
I'm reading xml from a folder with subfolders using the following schema: /my_folder/20161120/many xml's inside /my_folder/20161121/many xml's inside /my_folder/201611.../many xml's inside The current pipeline involves: XML -> JSON -> Avro -> HDFS where the HDFS folder structure is /my_folder/column=20161120/many avro's inside /my_folder/column=20161121/many avro's inside /my_folder/column=201611.../many avro's inside (each column= subfolder is a Hive partition) In order to reduce the number of avro's in HDFS, I'd love to merge 'em all. However, as NiFi just reads files from the source folders without any assumption on from which folders they're taken, even if I extract the date from the folder name (or file), this gets lost when using MergeContent. Using the Defragment strategy does not seems like an option, as I don't know in advance how many files I'll see. That said: isn't there any way to accomplish what I want to do? Current strategy is to simply merge the files "manually" using avro-tools and bash scripting. An alternative (although this is forcing what we want to do), is to partition by import date. Then I'd only need to take care of the midnight issue, for example by scheduling NiFi to fetch from the source every 10 minutes, but by doing a MergeContent every 5. If something isn't clear, please let me know. Thanks, Giovanni Thanks, Giovanni
