Not sure I'm following you on "So, the DMC is just so you won't duplicate
fetches if you're listing faster than you're fetching... got it". :)

Let's say the DMC is just here to store the state of the List processor
across the cluster in case the node goes down and a new primary node is
elected. But this is not really related to the Fetch processor (I may have
been misleading in my previous answer). Thanks to the state (timestamp
based IIRC), the List processor won't list the same file twice and it
ensures that you won't get duplicates.

The fact that we are using the DMC instead of the states provided by the
NiFi framework is maybe related to the fact that this processor has been
developed more than one year ago (and state management appeared about 11
months ago). ListFile for example also stores a state but does not need a
DMC. Maybe someone else can confirm or correct me if I'm wrong.

In fact I think that this processor could be improved to get rid of the
need of a DMC and relies on the NiFi framework to store the state of the
processor.

Pierre



2016-12-15 22:39 GMT+01:00 Nicholas Hughes <[email protected]>:

> Pierre,
>
> Thank you for the quick response. So, the DMC is just so you won't
> duplicate fetches if you're listing faster than you're fetching... got it.
> The usage documentation is kinda vague about that, so I made it out to be
> more magical than it is. Thanks for pointing me in the right direction!
>
> -Nick
>
>
> On Thu, Dec 15, 2016 at 4:21 PM, Pierre Villard <
> [email protected]> wrote:
>
>> Hi Nicholas,
>>
>> You need to configure your ListSFTP processor to only run on the primary
>> node (scheduling strategy in processor configuration), then to send the
>> flow files to a RPG that points to an input port in the cluster itself (so
>> that flow files are distributed over the cluster and do not stay only on
>> the primary node), then the FetchSFTP processor will take care of
>> downloading the files. The ListSFTP, with its state (DistributedCache),
>> ensures that you don't download the same file twice, and a given file won't
>> be downloaded by two nodes at the same time.
>>
>> Hope this helps,
>> Pierre.
>>
>> 2016-12-15 22:13 GMT+01:00 Nicholas Hughes <[email protected]
>> m>:
>>
>>> I'm testing a simple List/Fetch setup on a 3 node cluster. I created a
>>> DistributedMapCacheServer controller service with the default settings (no
>>> SSL) and then created a DistributedMapCacheClientService that points at
>>> one of the cluster hostnames. The ListSFTP processor is set to use the
>>> Distributed Cache Service that I created.
>>>
>>> The ListSFTP processor lists the same 100 source files from the remote
>>> system on each node, and sends 300 Flow Files downstream to the FetchSFTP
>>> processor. I thought that the map cache allowed the cluster nodes to
>>> determine which files had already been listed by other cluster nodes...
>>> maybe I'm missing something.
>>>
>>> Any assistance is appreciated.
>>>
>>> NiFi version 1.0.0 in HDF 2.0.1
>>>
>>>
>>> -Nick
>>>
>>>
>>
>

Reply via email to