I'm pulling data from API endpoints every five minutes and putting it into
HDFS. This, however, is giving me quite a few small files. 288 files per
day times however many endpoints I am reading. My current approach for
handling them is to load the small files into some sort of staging
directory under each of the endpoint directories. I then have list and
fetch HDFS processors pulling them back into NiFi so that I can merge them
based on size. This way I can keep the files in HDFS as they are waiting to
be merged so they can be queried at any time. When they get close to an
HDFS block size, I then merge them into an archive directory and delete the
small files that were merged.
My biggest problem with this is that I have to pull the files into NiFi
where they might sit for extended periods waiting to be merged. This causes
problems that I think are related to the problems brought up in
NIFI-3376 where my content repository continues to grow unbounded and fills
up my disk.
I was wondering what other patterns people are using for this sort of stuff.