Michael, If the file timestamp on the SFTP server is reliable, you can just use the ListSFTP processor, which feeds into the FetchSFTP processor to fetch the files one by one. The ListSFTP processor saves list/process timestamp so that only files with newer timestamp are output from it.
Caution: If there are a lot of files, e.g., tens of thousands, the ListSFTP processor does not work well. Everytime it runs, it will fetch the metadata of all files, loop through them to check the timestamp. Huagen On Fri, Jun 24, 2016 at 1:07 PM, Michael Dyer <[email protected]> wrote: > I'm looking for assistance in how to configure a set of processors to so > that I only retrieve 'new' files: > > - A GetSFTP processor that executes on a daily basis. > - The GetSFTP processor has read-only access to the remote site > - Large (Multi-GB) files are added to the remote site daily. > - Naming of the files is unpredictable. > - Files are rotated (removed) from the site after approximately 1 week > > Currently, I'm having to transfer ALL of the files on a daily basis and > then I use PutHDFS processor which ignores (discards) any duplicates. > Having to re-transfer files I already have is very inefficient, especially > given the large file sizes. > > Does anyone know of a pattern to: > > 1) Retrieve a list of files > 2) Compare each file against HDFS and > 3) Retrieve any 'missing' files? > > I tried building this with ListSFTP, but then ran into a problem that > GetSFTP does not allow me to the ListSFTP results as an input. > > Thanks for the help! > > Michael > >
