Hi,

I'm experimenting with a locally installed 3 node nifi cluster. This
cluster receives UDP packets on the primary node.
These nodes are pretty powerful, have a good network connection, have lots
of memory and SSD disks. I gave nifi 24G of java heap (xms and xmx).

I have configured a ListenUDP processor that listens on a UDP port and it
receives somewhere between 20000 to 50000 packets per 5 minutes. It's "Max
size of message queue" is large enough (1M), I gave it 5 concurrent tasks,
it's running on the primary node only.

The problem: after running for a while, I get the following error:
"internal queue at maximum capacity, could not queue event."

I have reviewed the source code and understand when this happens. It
happens when the processor tries to store an event in a java
LinkedBlockingQueue and that queue reached its maximum capacity. The
offer() method has a 100ms timeout in which it waits for space to free up
and then it fails and the event gets dropped. In the logs I see exactly 10
of these error messages per second (10 x 100ms is 1 second). Despite these
errors, I still get a very good rate of events that get through to the next
processors. Actually, it seems pretty much all of the other events get
through since the message rate in ListenUDP and the followup processor are
very much alike. The followup processors can easily handle the load and
there are no full queues, congestions or anything like that.

What I have tried so far:

Increasing the "Max Size of Message Queue" setting helps, but only delays
the errors. They eventually return.

Increasing heap space is a suggestion I read from a past post: I think 24G
is more than enough actually? Perhaps even too much?

Increasing parallelism: concurrent tasks set to 5 or 10 does not help.

I modified the code to use an ArrayBlockingQueue instead of the
LinkedBlockingQueue, thinking it was some kind of garbage collection. This
didn't help.

I increased "Receive Buffer Size", "Max Size of Socket Buffer" but to no
avail.

I tried batching. This helps a bit, like increasing the "Max Size of
Message Queue" it only seems to delay the eventual error messages though.

I reproduced this on my local workstation. I installed nifi, did no OS
tuning at all, set the heap size to 4GB. I generate 1.3M UDP packets per 5
minutes (the max I can reach with a simple python script). With "Max Size
of Message Queue" set to only 100, soon the error appears. In the ListenUDP
processor I see 1.34M events out, on the followup processor I see 1.34M
events incoming. The error is not as frequent as on the cluster though,
only a few every couple of minutes while the data rate is much higher and
the queue much smaller. I'm a bit desperate and hope anyone can help me
out. Why am I getting this error on a relatively quiet cluster with not
that much load?

Best regards,
Erik-Jan van Baaren

Reply via email to