Hi Yari, Thanks for the great question. I looked at the DistributedMapCacheClient/Server code briefly, but there's no high availability support with NiFi cluster. As you figured it out, we need to point one of nodes IP address, in order to share the same Cache storage among nodes within the same cluster, and if the node goes down, the client processors stop working until the node recovers or client service configuration is updated.
Although it has 'Distributed' in its name, the cache storage itself is not distributed, it just supports multiple client nodes, and it does not implement any coordination logic that work nicely with NiFi cluster. I think we might be able to use primary node to improve its availability. By adding option to DistributedCacheServer to run only on a primary node, and also, add option to client service to point a primary node (without specifying a specific ip address or hostname). Then let Zookeeper and NiFi cluster handles fail-over scenario. The whole cache entries will be invalidated when fail-over happens. So, things like DetectDuplicate won't work right after a fail-over. This idea only helps NiFi cluster and data flow recover automatically without human intervention. We could implement cache replication between nodes as well to provide higher availability, or hashing to utilize resources on every nodes, but I think it's overkill for NiFi. If one needs such level of availability, then I'd recommend to use other NoSQL databases. How do you think about that? I'd like to hear from others, too, to see if it's worth for trying. Thanks, Koji On Fri, Nov 4, 2016 at 5:38 PM, Yari Marchetti <[email protected]> wrote: > Hello, > I'm running a 3 nodes cluster and I've been trying to implement a > deduplication workflow using the DetectDuplicate but, on my first try, I > noticed that there were always 3 messages marked as non-duplicates. After > some investigation I tracked down this issue to be related to a > configuration I did for DistributedMapCache server address which was set to > localhost: if instead I set it to the IP of one of the nodes than > everything's working as expected. > > My concern with this approach is of reliability: if that specific node goes > down, than the workflow will not work properly. I know I could implement it > using some other kind of storage but wanted to check first whether I got it > right and what's the suggested approach. > > Thanks, > Yari
