I'm looking for assistance in how to configure a set of processors to so that I only retrieve 'new' files:
- A GetSFTP processor that executes on a daily basis. - The GetSFTP processor has read-only access to the remote site - Large (Multi-GB) files are added to the remote site daily. - Naming of the files is unpredictable. - Files are rotated (removed) from the site after approximately 1 week Currently, I'm having to transfer ALL of the files on a daily basis and then I use PutHDFS processor which ignores (discards) any duplicates. Having to re-transfer files I already have is very inefficient, especially given the large file sizes. Does anyone know of a pattern to: 1) Retrieve a list of files 2) Compare each file against HDFS and 3) Retrieve any 'missing' files? I tried building this with ListSFTP, but then ran into a problem that GetSFTP does not allow me to the ListSFTP results as an input. Thanks for the help! Michael
