Hi Erik, It sounds like you have tried most of the common tuning options that can be done. I would have expected batching + increasing concurrent tasks from 1 to 3-5 to be the biggest improvement.
Have you increased the number of threads in your overall thread pool according to your hardware? (from the top right menu controller settings) I would be curious what happens if you did some tests increasing the timeout where it attempts to place the message in the queue from 100ms to 200ms and then maybe 500ms if it still happens. I know this requires a code change since that timeout is hard-coded, but it sounds like you already went down that path with trying a different queue :) -Bryan On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <[email protected]> wrote: > > Hi, > > I'm experimenting with a locally installed 3 node nifi cluster. This cluster > receives UDP packets on the primary node. > These nodes are pretty powerful, have a good network connection, have lots of > memory and SSD disks. I gave nifi 24G of java heap (xms and xmx). > > I have configured a ListenUDP processor that listens on a UDP port and it > receives somewhere between 20000 to 50000 packets per 5 minutes. It's "Max > size of message queue" is large enough (1M), I gave it 5 concurrent tasks, > it's running on the primary node only. > > The problem: after running for a while, I get the following error: "internal > queue at maximum capacity, could not queue event." > > I have reviewed the source code and understand when this happens. It happens > when the processor tries to store an event in a java LinkedBlockingQueue and > that queue reached its maximum capacity. The offer() method has a 100ms > timeout in which it waits for space to free up and then it fails and the > event gets dropped. In the logs I see exactly 10 of these error messages per > second (10 x 100ms is 1 second). Despite these errors, I still get a very > good rate of events that get through to the next processors. Actually, it > seems pretty much all of the other events get through since the message rate > in ListenUDP and the followup processor are very much alike. The followup > processors can easily handle the load and there are no full queues, > congestions or anything like that. > > What I have tried so far: > > Increasing the "Max Size of Message Queue" setting helps, but only delays the > errors. They eventually return. > > Increasing heap space is a suggestion I read from a past post: I think 24G is > more than enough actually? Perhaps even too much? > > Increasing parallelism: concurrent tasks set to 5 or 10 does not help. > > I modified the code to use an ArrayBlockingQueue instead of the > LinkedBlockingQueue, thinking it was some kind of garbage collection. This > didn't help. > > I increased "Receive Buffer Size", "Max Size of Socket Buffer" but to no > avail. > > I tried batching. This helps a bit, like increasing the "Max Size of Message > Queue" it only seems to delay the eventual error messages though. > > I reproduced this on my local workstation. I installed nifi, did no OS tuning > at all, set the heap size to 4GB. I generate 1.3M UDP packets per 5 minutes > (the max I can reach with a simple python script). With "Max Size of Message > Queue" set to only 100, soon the error appears. In the ListenUDP processor I > see 1.34M events out, on the followup processor I see 1.34M events incoming. > The error is not as frequent as on the cluster though, only a few every > couple of minutes while the data rate is much higher and the queue much > smaller. I'm a bit desperate and hope anyone can help me out. Why am I > getting this error on a relatively quiet cluster with not that much load? > > Best regards, > Erik-Jan van Baaren
