This is what I do basically. I want a highly available setup with a minimum of components since each component can (and will) fail. Traffic reaches all nodes but only a single node should read it.
Op wo 5 jun. 2019 18:07 schreef James Srinivasan <[email protected] >: > In our case the stream is UDP broadcast, so available to all nodes anyway. > I've been meaning to test UDP multicast but not got round to it yet. > > > On Wed, 5 Jun 2019, 17:03 Bryan Bende, <[email protected]> wrote: > >> That is probably a valid point, but how about putting a load balancer >> in front to handle that? >> >> On Wed, Jun 5, 2019 at 11:30 AM James Srinivasan >> <[email protected]> wrote: >> > >> > Presumably you'd want to mirror the stream to all nodes for when the >> primary node changes? >> > >> > On Wed, 5 Jun 2019, 13:46 Bryan Bende, <[email protected]> wrote: >> >> >> >> The processor is started on all nodes, but onTrigger method is only >> >> executed on the primary node. >> >> >> >> This is something we've discussed trying to improve before, but the >> >> real question is why are you sending data to the other nodes if you >> >> don't expect the processor to execute there? >> >> >> >> On Wed, Jun 5, 2019 at 7:04 AM Erik-Jan <[email protected]> wrote: >> >> > >> >> > I figured it out after further testing. The processor runs on all >> nodes, despite the explicit "run on primary node only" option that I >> selected. But only on the primary node the queue is processed. On the other >> nodes the queue gets filled until the max is reached after which the error >> message starts appearing. What I missed before is that the message is >> coming from the other, non-primary nodes. >> >> > I'm not sure if this is intended behavior or if it is a bug though! >> For me it's a bug since I really want this processor to run on the primary >> only. >> >> > >> >> > Op di 4 jun. 2019 16:34 schreef Erik-Jan <[email protected]>: >> >> >> >> >> >> Hi Bryan, >> >> >> >> >> >> Yes I have considerably increased the numbers in the controller >> settings. >> >> >> I don't mind getting my hands dirty, increasing the timeout is >> worth a try. >> >> >> >> >> >> The errors seems to appear after quite a while. Usually I see these >> messages the next morning so testing and experimenting with this error >> takes a lot of time. >> >> >> >> >> >> Today I've been trying to reproduce this on a virtual machine with >> the same OS, Nifi and Java versions but to no avail. The difference is that >> this VM is not a cluster, has limited memory and cpu and still is able to >> handle much more UDP data with the error appearing only a few times so far >> after hours of running. It leads me to thinking there must be something in >> the configuration of the cluster thats causing this. I will also try a >> vanilla Nifi install on one of the nodes without clustering to see if my >> configuration and cluster setup is somehow the cause. >> >> >> >> >> >> Op di 4 jun. 2019 om 16:14 schreef Bryan Bende <[email protected]>: >> >> >>> >> >> >>> Hi Erik, >> >> >>> >> >> >>> It sounds like you have tried most of the common tuning options >> that >> >> >>> can be done. I would have expected batching + increasing concurrent >> >> >>> tasks from 1 to 3-5 to be the biggest improvement. >> >> >>> >> >> >>> Have you increased the number of threads in your overall thread >> pool >> >> >>> according to your hardware? (from the top right menu controller >> >> >>> settings) >> >> >>> >> >> >>> I would be curious what happens if you did some tests increasing >> the >> >> >>> timeout where it attempts to place the message in the queue from >> 100ms >> >> >>> to 200ms and then maybe 500ms if it still happens. >> >> >>> >> >> >>> I know this requires a code change since that timeout is >> hard-coded, >> >> >>> but it sounds like you already went down that path with trying a >> >> >>> different queue :) >> >> >>> >> >> >>> -Bryan >> >> >>> >> >> >>> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <[email protected]> wrote: >> >> >>> > >> >> >>> > Hi, >> >> >>> > >> >> >>> > I'm experimenting with a locally installed 3 node nifi cluster. >> This cluster receives UDP packets on the primary node. >> >> >>> > These nodes are pretty powerful, have a good network connection, >> have lots of memory and SSD disks. I gave nifi 24G of java heap (xms and >> xmx). >> >> >>> > >> >> >>> > I have configured a ListenUDP processor that listens on a UDP >> port and it receives somewhere between 20000 to 50000 packets per 5 >> minutes. It's "Max size of message queue" is large enough (1M), I gave it 5 >> concurrent tasks, it's running on the primary node only. >> >> >>> > >> >> >>> > The problem: after running for a while, I get the following >> error: "internal queue at maximum capacity, could not queue event." >> >> >>> > >> >> >>> > I have reviewed the source code and understand when this >> happens. It happens when the processor tries to store an event in a java >> LinkedBlockingQueue and that queue reached its maximum capacity. The >> offer() method has a 100ms timeout in which it waits for space to free up >> and then it fails and the event gets dropped. In the logs I see exactly 10 >> of these error messages per second (10 x 100ms is 1 second). Despite these >> errors, I still get a very good rate of events that get through to the next >> processors. Actually, it seems pretty much all of the other events get >> through since the message rate in ListenUDP and the followup processor are >> very much alike. The followup processors can easily handle the load and >> there are no full queues, congestions or anything like that. >> >> >>> > >> >> >>> > What I have tried so far: >> >> >>> > >> >> >>> > Increasing the "Max Size of Message Queue" setting helps, but >> only delays the errors. They eventually return. >> >> >>> > >> >> >>> > Increasing heap space is a suggestion I read from a past post: I >> think 24G is more than enough actually? Perhaps even too much? >> >> >>> > >> >> >>> > Increasing parallelism: concurrent tasks set to 5 or 10 does not >> help. >> >> >>> > >> >> >>> > I modified the code to use an ArrayBlockingQueue instead of the >> LinkedBlockingQueue, thinking it was some kind of garbage collection. This >> didn't help. >> >> >>> > >> >> >>> > I increased "Receive Buffer Size", "Max Size of Socket Buffer" >> but to no avail. >> >> >>> > >> >> >>> > I tried batching. This helps a bit, like increasing the "Max >> Size of Message Queue" it only seems to delay the eventual error messages >> though. >> >> >>> > >> >> >>> > I reproduced this on my local workstation. I installed nifi, did >> no OS tuning at all, set the heap size to 4GB. I generate 1.3M UDP packets >> per 5 minutes (the max I can reach with a simple python script). With "Max >> Size of Message Queue" set to only 100, soon the error appears. In the >> ListenUDP processor I see 1.34M events out, on the followup processor I see >> 1.34M events incoming. The error is not as frequent as on the cluster >> though, only a few every couple of minutes while the data rate is much >> higher and the queue much smaller. I'm a bit desperate and hope anyone can >> help me out. Why am I getting this error on a relatively quiet cluster with >> not that much load? >> >> >>> > >> >> >>> > Best regards, >> >> >>> > Erik-Jan van Baaren >> >
