Re: Clustering Questions
Hi Jon, Just as a note for your unrelated question: I opened NIFI-4026 few months ago but didn't have time to work on it so far. [1] https://issues.apache.org/jira/browse/NIFI-4026 2018-04-17 20:34 GMT+02:00 Jon Logan: > Thanks Joe, just a few follow-up questions: > > re:durability -- is this something that people have just been accepting as > a risk and hoping for the best? Or is this something people build their > applications around -- ie. using durability outside of the Nifi system > boundary and push it into a database, etc? > > re:heterogenous -- you can join nodes of differing hardware specs, but it > seems like you will end up causing your lighter-weight nodes to explode as > there's no way to configure how many tasks and how much to have processing > "in-flight" on the node different than the other nodes? ie. if I know my > large nodes can handle 3 of a cpu-intensive task, that's going to cause > issues for smaller nodes. This is an even bigger problem for differing > memory sizes. > > And an unrelated question to the previous -- is there a way to skew or > influence how a RPG distributes its tasks? Say, you wanted to do a group-by > type distribution? > > > Thanks again! > Jon > > > On Fri, Apr 13, 2018 at 2:17 PM, Joe Witt wrote: > >> Jon, >> >> Node Failure: >> You have to care about two things generally speaking. First is the >> flow execution and second is data in-flight >> For flow execution nifi clustering will take care of re-assigning the >> primary node and cluster coordinator as needed. >> For data we do not at present offer distributed data durability. The >> current model is predicated on using reliable storage such as RAID, >> EBS, etc.. >> There is a very clear and awesome looking K8S based path though that >> will make this work really nicely with persistent volumes and elastic >> scaling. No clear timeline but discussions/JIRA/contributions i hope >> to start or participate in soon. >> >> How scalable is the NiFi scaling model: >> Usually NiFi clusters are a few nodes to maybe 10-20 or so. Some >> have been larger but generally if you're needing that much flow >> management then often it makes more sense to have clusters dedicated >> along various domains of expertise anyway. So say 3-10 nodes with >> each handling 100,000 events per second around say 100MB per second >> (conservatively) and you can see why a single fairly small cluster can >> handle pretty massive volumes. >> >> RPGs feeding back: >> - This caused issues previously but I believe in recent releases has >> improved significantly. >> >> UI Actions Causing issues: >> There have been reports similar to this especially for some of the >> really massive flows we've seen in terms of number of components and >> concurrent users. These JIRAs when sorted will help a lot [1], [2], >> [3]. >> >> Heterogenous cluster nodes: >> - This should work quite well actually and is a major reason why NiFi >> and the S2S protocol supports/honors backpressure. Nodes that can >> take on more work take on more work and nodes that cannot pushback. >> You also want to ensure you're using good and scalable protocols to >> source data into the cluster. If you find you're using a lot of >> protocols requiring you to make many data sourcing steps run 'primary >> node only' then that will require that primary node to do more work >> than others and I have seen uneven behavior in such cases. Yes, you >> can then route using S2S/RPG which we recommend but still...try to >> design away from 'primary node only' when possible. >> >> >> Thanks >> Joe >> >> >> [1] https://issues.apache.org/jira/browse/NIFI-950 >> [2] https://issues.apache.org/jira/browse/NIFI-5064 >> [3] https://issues.apache.org/jira/browse/NIFI-5066 >> >> On Fri, Apr 13, 2018 at 5:49 PM, Jon Logan wrote: >> > All, I had a few general questions regarding Clustering, and was >> looking for >> > any sort of advice or best-practices information -- >> > >> > - documentation discusses failure handling primarily from a NiFi crash >> > scenario, but I don't recall seeing any information on entire >> node-failure >> > scenarios. Is there a way that this is supposed to be handled? >> > - at what point should we expect pain in scaling? I am particularly >> > concerned about the all-to-all relationship that seems to exist if you >> > connect a cluster RPG to itself, as all nodes need to distribute all >> data to >> > all other nodes. We have been also been having some issues when things >> are >> > not as responsive as NiFi would like -- namely, the UI seems to get very >> > upset and crash >> > - do UI actions (incl read-only) require delegation to all nodes >> underneath? >> > I suspect this is the case as otherwise you wouldn't be able to >> determine >> > queue sizes? >> > - is there a way to have a cluster with heterogeneous node sizes? >> > >> > >> > Thanks in advance! >> > >
Re: Clustering Questions
Thanks Joe, just a few follow-up questions: re:durability -- is this something that people have just been accepting as a risk and hoping for the best? Or is this something people build their applications around -- ie. using durability outside of the Nifi system boundary and push it into a database, etc? re:heterogenous -- you can join nodes of differing hardware specs, but it seems like you will end up causing your lighter-weight nodes to explode as there's no way to configure how many tasks and how much to have processing "in-flight" on the node different than the other nodes? ie. if I know my large nodes can handle 3 of a cpu-intensive task, that's going to cause issues for smaller nodes. This is an even bigger problem for differing memory sizes. And an unrelated question to the previous -- is there a way to skew or influence how a RPG distributes its tasks? Say, you wanted to do a group-by type distribution? Thanks again! Jon On Fri, Apr 13, 2018 at 2:17 PM, Joe Wittwrote: > Jon, > > Node Failure: > You have to care about two things generally speaking. First is the > flow execution and second is data in-flight > For flow execution nifi clustering will take care of re-assigning the > primary node and cluster coordinator as needed. > For data we do not at present offer distributed data durability. The > current model is predicated on using reliable storage such as RAID, > EBS, etc.. > There is a very clear and awesome looking K8S based path though that > will make this work really nicely with persistent volumes and elastic > scaling. No clear timeline but discussions/JIRA/contributions i hope > to start or participate in soon. > > How scalable is the NiFi scaling model: > Usually NiFi clusters are a few nodes to maybe 10-20 or so. Some > have been larger but generally if you're needing that much flow > management then often it makes more sense to have clusters dedicated > along various domains of expertise anyway. So say 3-10 nodes with > each handling 100,000 events per second around say 100MB per second > (conservatively) and you can see why a single fairly small cluster can > handle pretty massive volumes. > > RPGs feeding back: > - This caused issues previously but I believe in recent releases has > improved significantly. > > UI Actions Causing issues: > There have been reports similar to this especially for some of the > really massive flows we've seen in terms of number of components and > concurrent users. These JIRAs when sorted will help a lot [1], [2], > [3]. > > Heterogenous cluster nodes: > - This should work quite well actually and is a major reason why NiFi > and the S2S protocol supports/honors backpressure. Nodes that can > take on more work take on more work and nodes that cannot pushback. > You also want to ensure you're using good and scalable protocols to > source data into the cluster. If you find you're using a lot of > protocols requiring you to make many data sourcing steps run 'primary > node only' then that will require that primary node to do more work > than others and I have seen uneven behavior in such cases. Yes, you > can then route using S2S/RPG which we recommend but still...try to > design away from 'primary node only' when possible. > > > Thanks > Joe > > > [1] https://issues.apache.org/jira/browse/NIFI-950 > [2] https://issues.apache.org/jira/browse/NIFI-5064 > [3] https://issues.apache.org/jira/browse/NIFI-5066 > > On Fri, Apr 13, 2018 at 5:49 PM, Jon Logan wrote: > > All, I had a few general questions regarding Clustering, and was looking > for > > any sort of advice or best-practices information -- > > > > - documentation discusses failure handling primarily from a NiFi crash > > scenario, but I don't recall seeing any information on entire > node-failure > > scenarios. Is there a way that this is supposed to be handled? > > - at what point should we expect pain in scaling? I am particularly > > concerned about the all-to-all relationship that seems to exist if you > > connect a cluster RPG to itself, as all nodes need to distribute all > data to > > all other nodes. We have been also been having some issues when things > are > > not as responsive as NiFi would like -- namely, the UI seems to get very > > upset and crash > > - do UI actions (incl read-only) require delegation to all nodes > underneath? > > I suspect this is the case as otherwise you wouldn't be able to determine > > queue sizes? > > - is there a way to have a cluster with heterogeneous node sizes? > > > > > > Thanks in advance! >
Re: Clustering Questions
Jon, Node Failure: You have to care about two things generally speaking. First is the flow execution and second is data in-flight For flow execution nifi clustering will take care of re-assigning the primary node and cluster coordinator as needed. For data we do not at present offer distributed data durability. The current model is predicated on using reliable storage such as RAID, EBS, etc.. There is a very clear and awesome looking K8S based path though that will make this work really nicely with persistent volumes and elastic scaling. No clear timeline but discussions/JIRA/contributions i hope to start or participate in soon. How scalable is the NiFi scaling model: Usually NiFi clusters are a few nodes to maybe 10-20 or so. Some have been larger but generally if you're needing that much flow management then often it makes more sense to have clusters dedicated along various domains of expertise anyway. So say 3-10 nodes with each handling 100,000 events per second around say 100MB per second (conservatively) and you can see why a single fairly small cluster can handle pretty massive volumes. RPGs feeding back: - This caused issues previously but I believe in recent releases has improved significantly. UI Actions Causing issues: There have been reports similar to this especially for some of the really massive flows we've seen in terms of number of components and concurrent users. These JIRAs when sorted will help a lot [1], [2], [3]. Heterogenous cluster nodes: - This should work quite well actually and is a major reason why NiFi and the S2S protocol supports/honors backpressure. Nodes that can take on more work take on more work and nodes that cannot pushback. You also want to ensure you're using good and scalable protocols to source data into the cluster. If you find you're using a lot of protocols requiring you to make many data sourcing steps run 'primary node only' then that will require that primary node to do more work than others and I have seen uneven behavior in such cases. Yes, you can then route using S2S/RPG which we recommend but still...try to design away from 'primary node only' when possible. Thanks Joe [1] https://issues.apache.org/jira/browse/NIFI-950 [2] https://issues.apache.org/jira/browse/NIFI-5064 [3] https://issues.apache.org/jira/browse/NIFI-5066 On Fri, Apr 13, 2018 at 5:49 PM, Jon Loganwrote: > All, I had a few general questions regarding Clustering, and was looking for > any sort of advice or best-practices information -- > > - documentation discusses failure handling primarily from a NiFi crash > scenario, but I don't recall seeing any information on entire node-failure > scenarios. Is there a way that this is supposed to be handled? > - at what point should we expect pain in scaling? I am particularly > concerned about the all-to-all relationship that seems to exist if you > connect a cluster RPG to itself, as all nodes need to distribute all data to > all other nodes. We have been also been having some issues when things are > not as responsive as NiFi would like -- namely, the UI seems to get very > upset and crash > - do UI actions (incl read-only) require delegation to all nodes underneath? > I suspect this is the case as otherwise you wouldn't be able to determine > queue sizes? > - is there a way to have a cluster with heterogeneous node sizes? > > > Thanks in advance!