Re: ListenUDP: internal queue at maximum capacity, could not queue event

Bryan Bende Wed, 05 Jun 2019 09:59:43 -0700

Here is the existing JIRA:

https://issues.apache.org/jira/browse/NIFI-2592


If we implemented that then the OnScheduled of ListenUDP would never
get called on the non-primary nodes, which would then never start the
listener.

On Wed, Jun 5, 2019 at 12:33 PM Joe Witt <joe.w...@gmail.com> wrote:
>
> ...this feels like a bug to me.  I think Erik-Jan's expectation that nothing 
> would have begun for ListenUDP given primary node only config is fair.  I 
> also think our current position of 'just not calling onTrigger' is fair too 
> but less intuitive for users.
>
> What do ya'll think?
>
> On Wed, Jun 5, 2019 at 12:10 PM Erik-Jan <erik...@gmail.com> wrote:
>>
>> This is what I do basically. I want a highly available setup with a minimum 
>> of components since each component can (and will) fail. Traffic reaches all 
>> nodes  but only a single node should read it.
>>
>> Op wo 5 jun. 2019 18:07 schreef James Srinivasan 
>> <james.sriniva...@gmail.com>:
>>>
>>> In our case the stream is UDP broadcast, so available to all nodes anyway. 
>>> I've been meaning to test UDP multicast but not got round to it yet.
>>>
>>>
>>> On Wed, 5 Jun 2019, 17:03 Bryan Bende, <bbe...@gmail.com> wrote:
>>>>
>>>> That is probably a valid point, but how about putting a load balancer
>>>> in front to handle that?
>>>>
>>>> On Wed, Jun 5, 2019 at 11:30 AM James Srinivasan
>>>> <james.sriniva...@gmail.com> wrote:
>>>> >
>>>> > Presumably you'd want to mirror the stream to all nodes for when the 
>>>> > primary node changes?
>>>> >
>>>> > On Wed, 5 Jun 2019, 13:46 Bryan Bende, <bbe...@gmail.com> wrote:
>>>> >>
>>>> >> The processor is started on all nodes, but onTrigger method is only
>>>> >> executed on the primary node.
>>>> >>
>>>> >> This is something we've discussed trying to improve before, but the
>>>> >> real question is why are you sending data to the other nodes if you
>>>> >> don't expect the processor to execute there?
>>>> >>
>>>> >> On Wed, Jun 5, 2019 at 7:04 AM Erik-Jan <erik...@gmail.com> wrote:
>>>> >> >
>>>> >> > I figured it out after further testing. The processor runs on all 
>>>> >> > nodes, despite the explicit "run on primary node only" option that I 
>>>> >> > selected. But only on the primary node the queue is processed. On the 
>>>> >> > other nodes the queue gets filled until the max is reached after 
>>>> >> > which the error message starts appearing. What I missed before is 
>>>> >> > that the message is coming from the other, non-primary nodes.
>>>> >> > I'm not sure if this is intended behavior or if it is a bug though! 
>>>> >> > For me it's a bug since I really want this processor to run on the 
>>>> >> > primary only.
>>>> >> >
>>>> >> > Op di 4 jun. 2019 16:34 schreef Erik-Jan <erik...@gmail.com>:
>>>> >> >>
>>>> >> >> Hi Bryan,
>>>> >> >>
>>>> >> >> Yes I have considerably increased the numbers in the controller 
>>>> >> >> settings.
>>>> >> >> I don't mind getting my hands dirty, increasing the timeout is worth 
>>>> >> >> a try.
>>>> >> >>
>>>> >> >> The errors seems to appear after quite a while. Usually I see these 
>>>> >> >> messages the next morning so testing and experimenting with this 
>>>> >> >> error takes a lot of time.
>>>> >> >>
>>>> >> >> Today I've been trying to reproduce this on a virtual machine with 
>>>> >> >> the same OS, Nifi and Java versions but to no avail. The difference 
>>>> >> >> is that this VM is not a cluster, has limited memory and cpu and 
>>>> >> >> still is able to handle much more UDP data with the error appearing 
>>>> >> >> only a few times so far after hours of running. It leads me to 
>>>> >> >> thinking there must be something in the configuration of the cluster 
>>>> >> >> thats causing this. I will also try a vanilla Nifi install on one of 
>>>> >> >> the nodes without clustering to see if my configuration and cluster 
>>>> >> >> setup is somehow the cause.
>>>> >> >>
>>>> >> >> Op di 4 jun. 2019 om 16:14 schreef Bryan Bende <bbe...@gmail.com>:
>>>> >> >>>
>>>> >> >>> Hi Erik,
>>>> >> >>>
>>>> >> >>> It sounds like you have tried most of the common tuning options that
>>>> >> >>> can be done. I would have expected batching + increasing concurrent
>>>> >> >>> tasks from 1 to 3-5 to be the biggest improvement.
>>>> >> >>>
>>>> >> >>> Have you increased the number of threads in your overall thread pool
>>>> >> >>> according to your hardware? (from the top right menu controller
>>>> >> >>> settings)
>>>> >> >>>
>>>> >> >>> I would be curious what happens if you did some tests increasing the
>>>> >> >>> timeout where it attempts to place the message in the queue from 
>>>> >> >>> 100ms
>>>> >> >>> to 200ms and then maybe 500ms if it still happens.
>>>> >> >>>
>>>> >> >>> I know this requires a code change since that timeout is hard-coded,
>>>> >> >>> but it sounds like you already went down that path with trying a
>>>> >> >>> different queue :)
>>>> >> >>>
>>>> >> >>> -Bryan
>>>> >> >>>
>>>> >> >>> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <erik...@gmail.com> wrote:
>>>> >> >>> >
>>>> >> >>> > Hi,
>>>> >> >>> >
>>>> >> >>> > I'm experimenting with a locally installed 3 node nifi cluster. 
>>>> >> >>> > This cluster receives UDP packets on the primary node.
>>>> >> >>> > These nodes are pretty powerful, have a good network connection, 
>>>> >> >>> > have lots of memory and SSD disks. I gave nifi 24G of java heap 
>>>> >> >>> > (xms and xmx).
>>>> >> >>> >
>>>> >> >>> > I have configured a ListenUDP processor that listens on a UDP 
>>>> >> >>> > port and it receives somewhere between 20000 to 50000 packets per 
>>>> >> >>> > 5 minutes. It's "Max size of message queue" is large enough (1M), 
>>>> >> >>> > I gave it 5 concurrent tasks, it's running on the primary node 
>>>> >> >>> > only.
>>>> >> >>> >
>>>> >> >>> > The problem: after running for a while, I get the following 
>>>> >> >>> > error: "internal queue at maximum capacity, could not queue 
>>>> >> >>> > event."
>>>> >> >>> >
>>>> >> >>> > I have reviewed the source code and understand when this happens. 
>>>> >> >>> > It happens when the processor tries to store an event in a java 
>>>> >> >>> > LinkedBlockingQueue and that queue reached its maximum capacity. 
>>>> >> >>> > The offer() method has a 100ms timeout in which it waits for 
>>>> >> >>> > space to free up and then it fails and the event gets dropped. In 
>>>> >> >>> > the logs I see exactly 10 of these error messages per second (10 
>>>> >> >>> > x 100ms is 1 second). Despite these errors, I still get a very 
>>>> >> >>> > good rate of events that get through to the next processors. 
>>>> >> >>> > Actually, it seems pretty much all of the other events get 
>>>> >> >>> > through since the message rate in ListenUDP and the followup 
>>>> >> >>> > processor are very much alike. The followup processors can easily 
>>>> >> >>> > handle the load and there are no full queues, congestions or 
>>>> >> >>> > anything like that.
>>>> >> >>> >
>>>> >> >>> > What I have tried so far:
>>>> >> >>> >
>>>> >> >>> > Increasing the "Max Size of Message Queue" setting helps, but 
>>>> >> >>> > only delays the errors. They eventually return.
>>>> >> >>> >
>>>> >> >>> > Increasing heap space is a suggestion I read from a past post: I 
>>>> >> >>> > think 24G is more than enough actually? Perhaps even too much?
>>>> >> >>> >
>>>> >> >>> > Increasing parallelism: concurrent tasks set to 5 or 10 does not 
>>>> >> >>> > help.
>>>> >> >>> >
>>>> >> >>> > I modified the code to use an ArrayBlockingQueue instead of the 
>>>> >> >>> > LinkedBlockingQueue, thinking it was some kind of garbage 
>>>> >> >>> > collection. This didn't help.
>>>> >> >>> >
>>>> >> >>> > I increased "Receive Buffer Size", "Max Size of Socket Buffer" 
>>>> >> >>> > but to no avail.
>>>> >> >>> >
>>>> >> >>> > I tried batching. This helps a bit, like increasing the "Max Size 
>>>> >> >>> > of Message Queue" it only seems to delay the eventual error 
>>>> >> >>> > messages though.
>>>> >> >>> >
>>>> >> >>> > I reproduced this on my local workstation. I installed nifi, did 
>>>> >> >>> > no OS tuning at all, set the heap size to 4GB. I generate 1.3M 
>>>> >> >>> > UDP packets per 5 minutes (the max I can reach with a simple 
>>>> >> >>> > python script). With "Max Size of Message Queue" set to only 100, 
>>>> >> >>> > soon the error appears. In the ListenUDP processor I see 1.34M 
>>>> >> >>> > events out, on the followup processor I see 1.34M events 
>>>> >> >>> > incoming. The error is not as frequent as on the cluster though, 
>>>> >> >>> > only a few every couple of minutes while the data rate is much 
>>>> >> >>> > higher and the queue much smaller. I'm a bit desperate and hope 
>>>> >> >>> > anyone can help me out. Why am I getting this error on a 
>>>> >> >>> > relatively quiet cluster with not that much load?
>>>> >> >>> >
>>>> >> >>> > Best regards,
>>>> >> >>> > Erik-Jan van Baaren

Re: ListenUDP: internal queue at maximum capacity, could not queue event

Reply via email to