Michael, Let me make an example to understand your requirement.
1. Let's start with no files on the SFTP server, then file A.txt lands, with a timestamp 1467042217000. 2. When ListSFTP runs at timestamp 1467042218000, it finds A.txt, and saves the timestamp 1467042218000. The ensuing FetchSFTP processor fetches A.txt. And the rest of the dataflow saves it to HDFS. 3. Then file B.txt lands, with a timestamp 1467042219000. When ListSFTP runs, it first finds both A.txt and B.txt, but after checking the file timestamp, it only outputs B.txt because B.txt's timestamp 1467042219000 is greater than the saved timestamp 1467042218000. 4. Then the rest of the dataflow just processes B.txt. Does that answer your question? The key here is that if the file timestamp on the SFTP server is reliable, you don't need to check if the file exists in HDFS. Huagen On Mon, Jun 27, 2016 at 11:54 AM, Michael Dyer <[email protected]> wrote: > After using ListSFTP to obtain the initial list of files, how can check to > see if that specific file exists on HDFS? ListHDFS won't take a flowfile > as an input. > > I can use a DetectDuplicate processor to keep track of files written to > HDFS, but it seems that there should be a way of directly checking without > having to involve a cache. > > On Fri, Jun 24, 2016 at 1:07 PM, Michael Dyer <[email protected]> > wrote: > >> I'm looking for assistance in how to configure a set of processors to so >> that I only retrieve 'new' files: >> >> - A GetSFTP processor that executes on a daily basis. >> - The GetSFTP processor has read-only access to the remote site >> - Large (Multi-GB) files are added to the remote site daily. >> - Naming of the files is unpredictable. >> - Files are rotated (removed) from the site after approximately 1 week >> >> Currently, I'm having to transfer ALL of the files on a daily basis and >> then I use PutHDFS processor which ignores (discards) any duplicates. >> Having to re-transfer files I already have is very inefficient, especially >> given the large file sizes. >> >> Does anyone know of a pattern to: >> >> 1) Retrieve a list of files >> 2) Compare each file against HDFS and >> 3) Retrieve any 'missing' files? >> >> I tried building this with ListSFTP, but then ran into a problem that >> GetSFTP does not allow me to the ListSFTP results as an input. >> >> Thanks for the help! >> >> Michael >> >> >
