Hi all,

I have the following use case:

I'm reading xml from a folder with subfolders using the following schema:

/my_folder/20161120/many xml's inside
/my_folder/20161121/many xml's inside
/my_folder/201611.../many xml's inside

The current pipeline involves: XML -> JSON -> Avro -> HDFS 

where the HDFS folder structure is

/my_folder/column=20161120/many avro's inside
/my_folder/column=20161121/many avro's inside
/my_folder/column=201611.../many avro's inside

(each column= subfolder is a Hive partition)

In order to reduce the number of avro's in HDFS, I'd love to merge 'em all. 

However, as NiFi just reads files from the source folders without any 
assumption on from which folders they're taken, even if I extract the date from 
the folder name (or file), this gets lost when using MergeContent. Using the 
Defragment strategy does not seems like an option, as I don't know in advance 
how many files I'll see.

That said: isn't there any way to accomplish what I want to do?

Current strategy is to simply merge the files "manually" using avro-tools and 
bash scripting.

An alternative (although this is forcing what we want to do), is to partition 
by import date. Then I'd only need to take care of the midnight issue, for 
example by scheduling NiFi to fetch from the source every 10 minutes, but by 
doing a MergeContent every 5.

If something isn't clear, please let me know.

Thanks,

Giovanni

Thanks,

Giovanni 

Reply via email to