Re: ListenUDP: internal queue at maximum capacity, could not queue event

James Srinivasan Wed, 05 Jun 2019 08:30:45 -0700

Presumably you'd want to mirror the stream to all nodes for when the
primary node changes?


On Wed, 5 Jun 2019, 13:46 Bryan Bende, <[email protected]> wrote:

> The processor is started on all nodes, but onTrigger method is only
> executed on the primary node.
>
> This is something we've discussed trying to improve before, but the
> real question is why are you sending data to the other nodes if you
> don't expect the processor to execute there?
>
> On Wed, Jun 5, 2019 at 7:04 AM Erik-Jan <[email protected]> wrote:
> >
> > I figured it out after further testing. The processor runs on all nodes,
> despite the explicit "run on primary node only" option that I selected. But
> only on the primary node the queue is processed. On the other nodes the
> queue gets filled until the max is reached after which the error message
> starts appearing. What I missed before is that the message is coming from
> the other, non-primary nodes.
> > I'm not sure if this is intended behavior or if it is a bug though! For
> me it's a bug since I really want this processor to run on the primary only.
> >
> > Op di 4 jun. 2019 16:34 schreef Erik-Jan <[email protected]>:
> >>
> >> Hi Bryan,
> >>
> >> Yes I have considerably increased the numbers in the controller
> settings.
> >> I don't mind getting my hands dirty, increasing the timeout is worth a
> try.
> >>
> >> The errors seems to appear after quite a while. Usually I see these
> messages the next morning so testing and experimenting with this error
> takes a lot of time.
> >>
> >> Today I've been trying to reproduce this on a virtual machine with the
> same OS, Nifi and Java versions but to no avail. The difference is that
> this VM is not a cluster, has limited memory and cpu and still is able to
> handle much more UDP data with the error appearing only a few times so far
> after hours of running. It leads me to thinking there must be something in
> the configuration of the cluster thats causing this. I will also try a
> vanilla Nifi install on one of the nodes without clustering to see if my
> configuration and cluster setup is somehow the cause.
> >>
> >> Op di 4 jun. 2019 om 16:14 schreef Bryan Bende <[email protected]>:
> >>>
> >>> Hi Erik,
> >>>
> >>> It sounds like you have tried most of the common tuning options that
> >>> can be done. I would have expected batching + increasing concurrent
> >>> tasks from 1 to 3-5 to be the biggest improvement.
> >>>
> >>> Have you increased the number of threads in your overall thread pool
> >>> according to your hardware? (from the top right menu controller
> >>> settings)
> >>>
> >>> I would be curious what happens if you did some tests increasing the
> >>> timeout where it attempts to place the message in the queue from 100ms
> >>> to 200ms and then maybe 500ms if it still happens.
> >>>
> >>> I know this requires a code change since that timeout is hard-coded,
> >>> but it sounds like you already went down that path with trying a
> >>> different queue :)
> >>>
> >>> -Bryan
> >>>
> >>> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <[email protected]> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > I'm experimenting with a locally installed 3 node nifi cluster. This
> cluster receives UDP packets on the primary node.
> >>> > These nodes are pretty powerful, have a good network connection,
> have lots of memory and SSD disks. I gave nifi 24G of java heap (xms and
> xmx).
> >>> >
> >>> > I have configured a ListenUDP processor that listens on a UDP port
> and it receives somewhere between 20000 to 50000 packets per 5 minutes.
> It's "Max size of message queue" is large enough (1M), I gave it 5
> concurrent tasks, it's running on the primary node only.
> >>> >
> >>> > The problem: after running for a while, I get the following error:
> "internal queue at maximum capacity, could not queue event."
> >>> >
> >>> > I have reviewed the source code and understand when this happens. It
> happens when the processor tries to store an event in a java
> LinkedBlockingQueue and that queue reached its maximum capacity. The
> offer() method has a 100ms timeout in which it waits for space to free up
> and then it fails and the event gets dropped. In the logs I see exactly 10
> of these error messages per second (10 x 100ms is 1 second). Despite these
> errors, I still get a very good rate of events that get through to the next
> processors. Actually, it seems pretty much all of the other events get
> through since the message rate in ListenUDP and the followup processor are
> very much alike. The followup processors can easily handle the load and
> there are no full queues, congestions or anything like that.
> >>> >
> >>> > What I have tried so far:
> >>> >
> >>> > Increasing the "Max Size of Message Queue" setting helps, but only
> delays the errors. They eventually return.
> >>> >
> >>> > Increasing heap space is a suggestion I read from a past post: I
> think 24G is more than enough actually? Perhaps even too much?
> >>> >
> >>> > Increasing parallelism: concurrent tasks set to 5 or 10 does not
> help.
> >>> >
> >>> > I modified the code to use an ArrayBlockingQueue instead of the
> LinkedBlockingQueue, thinking it was some kind of garbage collection. This
> didn't help.
> >>> >
> >>> > I increased "Receive Buffer Size", "Max Size of Socket Buffer" but
> to no avail.
> >>> >
> >>> > I tried batching. This helps a bit, like increasing the "Max Size of
> Message Queue" it only seems to delay the eventual error messages though.
> >>> >
> >>> > I reproduced this on my local workstation. I installed nifi, did no
> OS tuning at all, set the heap size to 4GB. I generate 1.3M UDP packets per
> 5 minutes (the max I can reach with a simple python script). With "Max Size
> of Message Queue" set to only 100, soon the error appears. In the ListenUDP
> processor I see 1.34M events out, on the followup processor I see 1.34M
> events incoming. The error is not as frequent as on the cluster though,
> only a few every couple of minutes while the data rate is much higher and
> the queue much smaller. I'm a bit desperate and hope anyone can help me
> out. Why am I getting this error on a relatively quiet cluster with not
> that much load?
> >>> >
> >>> > Best regards,
> >>> > Erik-Jan van Baaren
>

Re: ListenUDP: internal queue at maximum capacity, could not queue event

Reply via email to