I figured it out after further testing. The processor runs on all nodes, despite the explicit "run on primary node only" option that I selected. But only on the primary node the queue is processed. On the other nodes the queue gets filled until the max is reached after which the error message starts appearing. What I missed before is that the message is coming from the other, non-primary nodes. I'm not sure if this is intended behavior or if it is a bug though! For me it's a bug since I really want this processor to run on the primary only.
Op di 4 jun. 2019 16:34 schreef Erik-Jan <[email protected]>: > Hi Bryan, > > Yes I have considerably increased the numbers in the controller settings. > I don't mind getting my hands dirty, increasing the timeout is worth a > try. > > The errors seems to appear after quite a while. Usually I see these > messages the next morning so testing and experimenting with this error > takes a lot of time. > > Today I've been trying to reproduce this on a virtual machine with the > same OS, Nifi and Java versions but to no avail. The difference is that > this VM is not a cluster, has limited memory and cpu and still is able to > handle much more UDP data with the error appearing only a few times so far > after hours of running. It leads me to thinking there must be something in > the configuration of the cluster thats causing this. I will also try a > vanilla Nifi install on one of the nodes without clustering to see if my > configuration and cluster setup is somehow the cause. > > Op di 4 jun. 2019 om 16:14 schreef Bryan Bende <[email protected]>: > >> Hi Erik, >> >> It sounds like you have tried most of the common tuning options that >> can be done. I would have expected batching + increasing concurrent >> tasks from 1 to 3-5 to be the biggest improvement. >> >> Have you increased the number of threads in your overall thread pool >> according to your hardware? (from the top right menu controller >> settings) >> >> I would be curious what happens if you did some tests increasing the >> timeout where it attempts to place the message in the queue from 100ms >> to 200ms and then maybe 500ms if it still happens. >> >> I know this requires a code change since that timeout is hard-coded, >> but it sounds like you already went down that path with trying a >> different queue :) >> >> -Bryan >> >> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <[email protected]> wrote: >> > >> > Hi, >> > >> > I'm experimenting with a locally installed 3 node nifi cluster. This >> cluster receives UDP packets on the primary node. >> > These nodes are pretty powerful, have a good network connection, have >> lots of memory and SSD disks. I gave nifi 24G of java heap (xms and xmx). >> > >> > I have configured a ListenUDP processor that listens on a UDP port and >> it receives somewhere between 20000 to 50000 packets per 5 minutes. It's >> "Max size of message queue" is large enough (1M), I gave it 5 concurrent >> tasks, it's running on the primary node only. >> > >> > The problem: after running for a while, I get the following error: >> "internal queue at maximum capacity, could not queue event." >> > >> > I have reviewed the source code and understand when this happens. It >> happens when the processor tries to store an event in a java >> LinkedBlockingQueue and that queue reached its maximum capacity. The >> offer() method has a 100ms timeout in which it waits for space to free up >> and then it fails and the event gets dropped. In the logs I see exactly 10 >> of these error messages per second (10 x 100ms is 1 second). Despite these >> errors, I still get a very good rate of events that get through to the next >> processors. Actually, it seems pretty much all of the other events get >> through since the message rate in ListenUDP and the followup processor are >> very much alike. The followup processors can easily handle the load and >> there are no full queues, congestions or anything like that. >> > >> > What I have tried so far: >> > >> > Increasing the "Max Size of Message Queue" setting helps, but only >> delays the errors. They eventually return. >> > >> > Increasing heap space is a suggestion I read from a past post: I think >> 24G is more than enough actually? Perhaps even too much? >> > >> > Increasing parallelism: concurrent tasks set to 5 or 10 does not help. >> > >> > I modified the code to use an ArrayBlockingQueue instead of the >> LinkedBlockingQueue, thinking it was some kind of garbage collection. This >> didn't help. >> > >> > I increased "Receive Buffer Size", "Max Size of Socket Buffer" but to >> no avail. >> > >> > I tried batching. This helps a bit, like increasing the "Max Size of >> Message Queue" it only seems to delay the eventual error messages though. >> > >> > I reproduced this on my local workstation. I installed nifi, did no OS >> tuning at all, set the heap size to 4GB. I generate 1.3M UDP packets per 5 >> minutes (the max I can reach with a simple python script). With "Max Size >> of Message Queue" set to only 100, soon the error appears. In the ListenUDP >> processor I see 1.34M events out, on the followup processor I see 1.34M >> events incoming. The error is not as frequent as on the cluster though, >> only a few every couple of minutes while the data rate is much higher and >> the queue much smaller. I'm a bit desperate and hope anyone can help me >> out. Why am I getting this error on a relatively quiet cluster with not >> that much load? >> > >> > Best regards, >> > Erik-Jan van Baaren >> >
