Jon, Node Failure: You have to care about two things generally speaking. First is the flow execution and second is data in-flight For flow execution nifi clustering will take care of re-assigning the primary node and cluster coordinator as needed. For data we do not at present offer distributed data durability. The current model is predicated on using reliable storage such as RAID, EBS, etc.. There is a very clear and awesome looking K8S based path though that will make this work really nicely with persistent volumes and elastic scaling. No clear timeline but discussions/JIRA/contributions i hope to start or participate in soon.
How scalable is the NiFi scaling model: Usually NiFi clusters are a few nodes to maybe 10-20 or so. Some have been larger but generally if you're needing that much flow management then often it makes more sense to have clusters dedicated along various domains of expertise anyway. So say 3-10 nodes with each handling 100,000 events per second around say 100MB per second (conservatively) and you can see why a single fairly small cluster can handle pretty massive volumes. RPGs feeding back: - This caused issues previously but I believe in recent releases has improved significantly. UI Actions Causing issues: There have been reports similar to this especially for some of the really massive flows we've seen in terms of number of components and concurrent users. These JIRAs when sorted will help a lot , , . Heterogenous cluster nodes: - This should work quite well actually and is a major reason why NiFi and the S2S protocol supports/honors backpressure. Nodes that can take on more work take on more work and nodes that cannot pushback. You also want to ensure you're using good and scalable protocols to source data into the cluster. If you find you're using a lot of protocols requiring you to make many data sourcing steps run 'primary node only' then that will require that primary node to do more work than others and I have seen uneven behavior in such cases. Yes, you can then route using S2S/RPG which we recommend but still...try to design away from 'primary node only' when possible. Thanks Joe  https://issues.apache.org/jira/browse/NIFI-950  https://issues.apache.org/jira/browse/NIFI-5064  https://issues.apache.org/jira/browse/NIFI-5066 On Fri, Apr 13, 2018 at 5:49 PM, Jon Logan <jmlo...@buffalo.edu> wrote: > All, I had a few general questions regarding Clustering, and was looking for > any sort of advice or best-practices information -- > > - documentation discusses failure handling primarily from a NiFi crash > scenario, but I don't recall seeing any information on entire node-failure > scenarios. Is there a way that this is supposed to be handled? > - at what point should we expect pain in scaling? I am particularly > concerned about the all-to-all relationship that seems to exist if you > connect a cluster RPG to itself, as all nodes need to distribute all data to > all other nodes. We have been also been having some issues when things are > not as responsive as NiFi would like -- namely, the UI seems to get very > upset and crash > - do UI actions (incl read-only) require delegation to all nodes underneath? > I suspect this is the case as otherwise you wouldn't be able to determine > queue sizes? > - is there a way to have a cluster with heterogeneous node sizes? > > > Thanks in advance!