I'm looking for assistance in how to configure a set of processors to so
that I only retrieve 'new' files:

- A GetSFTP processor that executes on a daily basis.
- The GetSFTP processor has read-only access to the remote site
- Large (Multi-GB) files are added to the remote site daily.
- Naming of the files is unpredictable.
- Files are rotated (removed) from the site after approximately 1 week

Currently, I'm having to transfer ALL of the files on a daily basis and
then I use PutHDFS processor which ignores (discards) any duplicates.
Having to re-transfer files I already have is very inefficient, especially
given the large file sizes.

Does anyone know of a pattern to:

1) Retrieve a list of files
2) Compare each file against HDFS and
3) Retrieve any 'missing' files?

I tried building this with ListSFTP, but then ran into a problem that
GetSFTP does not allow me to the ListSFTP results as an input.

Thanks for the help!

Michael

Reply via email to