Re: ListenUDP: internal queue at maximum capacity, could not queue event

Erik-Jan Wed, 12 Jun 2019 13:50:24 -0700

Bryan, I can not oversee the difficulty of implementing that.
I could have a try myself but I'm afraid I can not correctly estimate the
implications of such a change on other processors. It may fix this and
break other stuff?
Perhaps it can also be fixed in processors of just this type. Maybe by
using lazy instantiation or something along that line?


Op wo 5 jun. 2019 18:59 schreef Bryan Bende <[email protected]>:

> Here is the existing JIRA:
>
> https://issues.apache.org/jira/browse/NIFI-2592
>
> If we implemented that then the OnScheduled of ListenUDP would never
> get called on the non-primary nodes, which would then never start the
> listener.
>
> On Wed, Jun 5, 2019 at 12:33 PM Joe Witt <[email protected]> wrote:
> >
> > ...this feels like a bug to me.  I think Erik-Jan's expectation that
> nothing would have begun for ListenUDP given primary node only config is
> fair.  I also think our current position of 'just not calling onTrigger' is
> fair too but less intuitive for users.
> >
> > What do ya'll think?
> >
> > On Wed, Jun 5, 2019 at 12:10 PM Erik-Jan <[email protected]> wrote:
> >>
> >> This is what I do basically. I want a highly available setup with a
> minimum of components since each component can (and will) fail. Traffic
> reaches all nodes  but only a single node should read it.
> >>
> >> Op wo 5 jun. 2019 18:07 schreef James Srinivasan <
> [email protected]>:
> >>>
> >>> In our case the stream is UDP broadcast, so available to all nodes
> anyway. I've been meaning to test UDP multicast but not got round to it yet.
> >>>
> >>>
> >>> On Wed, 5 Jun 2019, 17:03 Bryan Bende, <[email protected]> wrote:
> >>>>
> >>>> That is probably a valid point, but how about putting a load balancer
> >>>> in front to handle that?
> >>>>
> >>>> On Wed, Jun 5, 2019 at 11:30 AM James Srinivasan
> >>>> <[email protected]> wrote:
> >>>> >
> >>>> > Presumably you'd want to mirror the stream to all nodes for when
> the primary node changes?
> >>>> >
> >>>> > On Wed, 5 Jun 2019, 13:46 Bryan Bende, <[email protected]> wrote:
> >>>> >>
> >>>> >> The processor is started on all nodes, but onTrigger method is only
> >>>> >> executed on the primary node.
> >>>> >>
> >>>> >> This is something we've discussed trying to improve before, but the
> >>>> >> real question is why are you sending data to the other nodes if you
> >>>> >> don't expect the processor to execute there?
> >>>> >>
> >>>> >> On Wed, Jun 5, 2019 at 7:04 AM Erik-Jan <[email protected]> wrote:
> >>>> >> >
> >>>> >> > I figured it out after further testing. The processor runs on
> all nodes, despite the explicit "run on primary node only" option that I
> selected. But only on the primary node the queue is processed. On the other
> nodes the queue gets filled until the max is reached after which the error
> message starts appearing. What I missed before is that the message is
> coming from the other, non-primary nodes.
> >>>> >> > I'm not sure if this is intended behavior or if it is a bug
> though! For me it's a bug since I really want this processor to run on the
> primary only.
> >>>> >> >
> >>>> >> > Op di 4 jun. 2019 16:34 schreef Erik-Jan <[email protected]>:
> >>>> >> >>
> >>>> >> >> Hi Bryan,
> >>>> >> >>
> >>>> >> >> Yes I have considerably increased the numbers in the controller
> settings.
> >>>> >> >> I don't mind getting my hands dirty, increasing the timeout is
> worth a try.
> >>>> >> >>
> >>>> >> >> The errors seems to appear after quite a while. Usually I see
> these messages the next morning so testing and experimenting with this
> error takes a lot of time.
> >>>> >> >>
> >>>> >> >> Today I've been trying to reproduce this on a virtual machine
> with the same OS, Nifi and Java versions but to no avail. The difference is
> that this VM is not a cluster, has limited memory and cpu and still is able
> to handle much more UDP data with the error appearing only a few times so
> far after hours of running. It leads me to thinking there must be something
> in the configuration of the cluster thats causing this. I will also try a
> vanilla Nifi install on one of the nodes without clustering to see if my
> configuration and cluster setup is somehow the cause.
> >>>> >> >>
> >>>> >> >> Op di 4 jun. 2019 om 16:14 schreef Bryan Bende <
> [email protected]>:
> >>>> >> >>>
> >>>> >> >>> Hi Erik,
> >>>> >> >>>
> >>>> >> >>> It sounds like you have tried most of the common tuning
> options that
> >>>> >> >>> can be done. I would have expected batching + increasing
> concurrent
> >>>> >> >>> tasks from 1 to 3-5 to be the biggest improvement.
> >>>> >> >>>
> >>>> >> >>> Have you increased the number of threads in your overall
> thread pool
> >>>> >> >>> according to your hardware? (from the top right menu controller
> >>>> >> >>> settings)
> >>>> >> >>>
> >>>> >> >>> I would be curious what happens if you did some tests
> increasing the
> >>>> >> >>> timeout where it attempts to place the message in the queue
> from 100ms
> >>>> >> >>> to 200ms and then maybe 500ms if it still happens.
> >>>> >> >>>
> >>>> >> >>> I know this requires a code change since that timeout is
> hard-coded,
> >>>> >> >>> but it sounds like you already went down that path with trying
> a
> >>>> >> >>> different queue :)
> >>>> >> >>>
> >>>> >> >>> -Bryan
> >>>> >> >>>
> >>>> >> >>> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <[email protected]>
> wrote:
> >>>> >> >>> >
> >>>> >> >>> > Hi,
> >>>> >> >>> >
> >>>> >> >>> > I'm experimenting with a locally installed 3 node nifi
> cluster. This cluster receives UDP packets on the primary node.
> >>>> >> >>> > These nodes are pretty powerful, have a good network
> connection, have lots of memory and SSD disks. I gave nifi 24G of java heap
> (xms and xmx).
> >>>> >> >>> >
> >>>> >> >>> > I have configured a ListenUDP processor that listens on a
> UDP port and it receives somewhere between 20000 to 50000 packets per 5
> minutes. It's "Max size of message queue" is large enough (1M), I gave it 5
> concurrent tasks, it's running on the primary node only.
> >>>> >> >>> >
> >>>> >> >>> > The problem: after running for a while, I get the following
> error: "internal queue at maximum capacity, could not queue event."
> >>>> >> >>> >
> >>>> >> >>> > I have reviewed the source code and understand when this
> happens. It happens when the processor tries to store an event in a java
> LinkedBlockingQueue and that queue reached its maximum capacity. The
> offer() method has a 100ms timeout in which it waits for space to free up
> and then it fails and the event gets dropped. In the logs I see exactly 10
> of these error messages per second (10 x 100ms is 1 second). Despite these
> errors, I still get a very good rate of events that get through to the next
> processors. Actually, it seems pretty much all of the other events get
> through since the message rate in ListenUDP and the followup processor are
> very much alike. The followup processors can easily handle the load and
> there are no full queues, congestions or anything like that.
> >>>> >> >>> >
> >>>> >> >>> > What I have tried so far:
> >>>> >> >>> >
> >>>> >> >>> > Increasing the "Max Size of Message Queue" setting helps,
> but only delays the errors. They eventually return.
> >>>> >> >>> >
> >>>> >> >>> > Increasing heap space is a suggestion I read from a past
> post: I think 24G is more than enough actually? Perhaps even too much?
> >>>> >> >>> >
> >>>> >> >>> > Increasing parallelism: concurrent tasks set to 5 or 10 does
> not help.
> >>>> >> >>> >
> >>>> >> >>> > I modified the code to use an ArrayBlockingQueue instead of
> the LinkedBlockingQueue, thinking it was some kind of garbage collection.
> This didn't help.
> >>>> >> >>> >
> >>>> >> >>> > I increased "Receive Buffer Size", "Max Size of Socket
> Buffer" but to no avail.
> >>>> >> >>> >
> >>>> >> >>> > I tried batching. This helps a bit, like increasing the "Max
> Size of Message Queue" it only seems to delay the eventual error messages
> though.
> >>>> >> >>> >
> >>>> >> >>> > I reproduced this on my local workstation. I installed nifi,
> did no OS tuning at all, set the heap size to 4GB. I generate 1.3M UDP
> packets per 5 minutes (the max I can reach with a simple python script).
> With "Max Size of Message Queue" set to only 100, soon the error appears.
> In the ListenUDP processor I see 1.34M events out, on the followup
> processor I see 1.34M events incoming. The error is not as frequent as on
> the cluster though, only a few every couple of minutes while the data rate
> is much higher and the queue much smaller. I'm a bit desperate and hope
> anyone can help me out. Why am I getting this error on a relatively quiet
> cluster with not that much load?
> >>>> >> >>> >
> >>>> >> >>> > Best regards,
> >>>> >> >>> > Erik-Jan van Baaren
>

Re: ListenUDP: internal queue at maximum capacity, could not queue event

Reply via email to