Hi, I'm experimenting with a locally installed 3 node nifi cluster. This cluster receives UDP packets on the primary node. These nodes are pretty powerful, have a good network connection, have lots of memory and SSD disks. I gave nifi 24G of java heap (xms and xmx).
I have configured a ListenUDP processor that listens on a UDP port and it receives somewhere between 20000 to 50000 packets per 5 minutes. It's "Max size of message queue" is large enough (1M), I gave it 5 concurrent tasks, it's running on the primary node only. The problem: after running for a while, I get the following error: "internal queue at maximum capacity, could not queue event." I have reviewed the source code and understand when this happens. It happens when the processor tries to store an event in a java LinkedBlockingQueue and that queue reached its maximum capacity. The offer() method has a 100ms timeout in which it waits for space to free up and then it fails and the event gets dropped. In the logs I see exactly 10 of these error messages per second (10 x 100ms is 1 second). Despite these errors, I still get a very good rate of events that get through to the next processors. Actually, it seems pretty much all of the other events get through since the message rate in ListenUDP and the followup processor are very much alike. The followup processors can easily handle the load and there are no full queues, congestions or anything like that. What I have tried so far: Increasing the "Max Size of Message Queue" setting helps, but only delays the errors. They eventually return. Increasing heap space is a suggestion I read from a past post: I think 24G is more than enough actually? Perhaps even too much? Increasing parallelism: concurrent tasks set to 5 or 10 does not help. I modified the code to use an ArrayBlockingQueue instead of the LinkedBlockingQueue, thinking it was some kind of garbage collection. This didn't help. I increased "Receive Buffer Size", "Max Size of Socket Buffer" but to no avail. I tried batching. This helps a bit, like increasing the "Max Size of Message Queue" it only seems to delay the eventual error messages though. I reproduced this on my local workstation. I installed nifi, did no OS tuning at all, set the heap size to 4GB. I generate 1.3M UDP packets per 5 minutes (the max I can reach with a simple python script). With "Max Size of Message Queue" set to only 100, soon the error appears. In the ListenUDP processor I see 1.34M events out, on the followup processor I see 1.34M events incoming. The error is not as frequent as on the cluster though, only a few every couple of minutes while the data rate is much higher and the queue much smaller. I'm a bit desperate and hope anyone can help me out. Why am I getting this error on a relatively quiet cluster with not that much load? Best regards, Erik-Jan van Baaren
