Hi Sudhindra, I think that current implementation of ListHDFS already gives you required functionality. I'll assume for a moment, that your "success" markers are just another files, having the same (or partial) name as a data file, just with some extension, like "*.fin", "*.done" or "*.success". You could use ListHDFS. It has a regex filter on a file name. So, having it like "^.+\.success$" will always bring new files (since last listing) having extension "*.success" (e.g. 201808050012.success). If you schedule to run ListHDFS processor daily (using timer for 1 day or using crontab expression for very specific hour, then it will wake up only once a day, will find all the success files for that day, and then your flow can find data files for success ones and upload to S3 using PutS3 processor).
Another story with directories. If you need a listing of directories, you could use GetHDFSFileInfo (can work recursively, having filters separately for dirs and for files). But this processor doesn't maintain a state, so you will need to maintain it yourself (zookeeper or hbase, or even distributed cache map). Regards, Ed. On Mon, Jul 30, 2018 at 6:34 PM Sudhindra Tirupati Nagaraj < [email protected]> wrote: > Hi, > > > > We just came across NIFI as a possible option for backing up our data lake > periodically into S3. We have our pipelines that dump batches of data at > some granularity. For example, our one-minute dumps are of the form > “201807210617”, “201807210618”, “201807210619” etc. We are looking for a > simple configuration based solution that reads these incoming batches > periodically and creates a workflow for backing these up. Also, these > batches have a “success” marker inside them that indicates that the batches > are full and ready to be backed up. We came across the ListHDFS processor > that can do this, without duplication, but we are not sure if it has the > ability to only copy batches that have a particular state (that is, like > having a success marker in them). We are not sure if it also works on > “folders” and not files directly. > > > > Can I get some recommendations on whether NIFI can be used at for such a > ingestion use-case/alternative? Thank you. > > > > Kind Regards, > > Sudhindra. >
