Hello, I'm pulling data from API endpoints every five minutes and putting it into HDFS. This, however, is giving me quite a few small files. 288 files per day times however many endpoints I am reading. My current approach for handling them is to load the small files into some sort of staging directory under each of the endpoint directories. I then have list and fetch HDFS processors pulling them back into NiFi so that I can merge them based on size. This way I can keep the files in HDFS as they are waiting to be merged so they can be queried at any time. When they get close to an HDFS block size, I then merge them into an archive directory and delete the small files that were merged.
My biggest problem with this is that I have to pull the files into NiFi where they might sit for extended periods waiting to be merged. This causes problems that I think are related to the problems brought up in NIFI-3376 where my content repository continues to grow unbounded and fills up my disk. I was wondering what other patterns people are using for this sort of stuff. Thanks!