Hi Bryan, Yes I have considerably increased the numbers in the controller settings. I don't mind getting my hands dirty, increasing the timeout is worth a try.
The errors seems to appear after quite a while. Usually I see these messages the next morning so testing and experimenting with this error takes a lot of time. Today I've been trying to reproduce this on a virtual machine with the same OS, Nifi and Java versions but to no avail. The difference is that this VM is not a cluster, has limited memory and cpu and still is able to handle much more UDP data with the error appearing only a few times so far after hours of running. It leads me to thinking there must be something in the configuration of the cluster thats causing this. I will also try a vanilla Nifi install on one of the nodes without clustering to see if my configuration and cluster setup is somehow the cause. Op di 4 jun. 2019 om 16:14 schreef Bryan Bende <[email protected]>: > Hi Erik, > > It sounds like you have tried most of the common tuning options that > can be done. I would have expected batching + increasing concurrent > tasks from 1 to 3-5 to be the biggest improvement. > > Have you increased the number of threads in your overall thread pool > according to your hardware? (from the top right menu controller > settings) > > I would be curious what happens if you did some tests increasing the > timeout where it attempts to place the message in the queue from 100ms > to 200ms and then maybe 500ms if it still happens. > > I know this requires a code change since that timeout is hard-coded, > but it sounds like you already went down that path with trying a > different queue :) > > -Bryan > > On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <[email protected]> wrote: > > > > Hi, > > > > I'm experimenting with a locally installed 3 node nifi cluster. This > cluster receives UDP packets on the primary node. > > These nodes are pretty powerful, have a good network connection, have > lots of memory and SSD disks. I gave nifi 24G of java heap (xms and xmx). > > > > I have configured a ListenUDP processor that listens on a UDP port and > it receives somewhere between 20000 to 50000 packets per 5 minutes. It's > "Max size of message queue" is large enough (1M), I gave it 5 concurrent > tasks, it's running on the primary node only. > > > > The problem: after running for a while, I get the following error: > "internal queue at maximum capacity, could not queue event." > > > > I have reviewed the source code and understand when this happens. It > happens when the processor tries to store an event in a java > LinkedBlockingQueue and that queue reached its maximum capacity. The > offer() method has a 100ms timeout in which it waits for space to free up > and then it fails and the event gets dropped. In the logs I see exactly 10 > of these error messages per second (10 x 100ms is 1 second). Despite these > errors, I still get a very good rate of events that get through to the next > processors. Actually, it seems pretty much all of the other events get > through since the message rate in ListenUDP and the followup processor are > very much alike. The followup processors can easily handle the load and > there are no full queues, congestions or anything like that. > > > > What I have tried so far: > > > > Increasing the "Max Size of Message Queue" setting helps, but only > delays the errors. They eventually return. > > > > Increasing heap space is a suggestion I read from a past post: I think > 24G is more than enough actually? Perhaps even too much? > > > > Increasing parallelism: concurrent tasks set to 5 or 10 does not help. > > > > I modified the code to use an ArrayBlockingQueue instead of the > LinkedBlockingQueue, thinking it was some kind of garbage collection. This > didn't help. > > > > I increased "Receive Buffer Size", "Max Size of Socket Buffer" but to no > avail. > > > > I tried batching. This helps a bit, like increasing the "Max Size of > Message Queue" it only seems to delay the eventual error messages though. > > > > I reproduced this on my local workstation. I installed nifi, did no OS > tuning at all, set the heap size to 4GB. I generate 1.3M UDP packets per 5 > minutes (the max I can reach with a simple python script). With "Max Size > of Message Queue" set to only 100, soon the error appears. In the ListenUDP > processor I see 1.34M events out, on the followup processor I see 1.34M > events incoming. The error is not as frequent as on the cluster though, > only a few every couple of minutes while the data rate is much higher and > the queue much smaller. I'm a bit desperate and hope anyone can help me > out. Why am I getting this error on a relatively quiet cluster with not > that much load? > > > > Best regards, > > Erik-Jan van Baaren >
