Re: ListenUDP: internal queue at maximum capacity, could not queue event

Bryan Bende Wed, 05 Jun 2019 09:04:06 -0700

That is probably a valid point, but how about putting a load balancer
in front to handle that?


On Wed, Jun 5, 2019 at 11:30 AM James Srinivasan
<[email protected]> wrote:
>
> Presumably you'd want to mirror the stream to all nodes for when the primary 
> node changes?
>
> On Wed, 5 Jun 2019, 13:46 Bryan Bende, <[email protected]> wrote:
>>
>> The processor is started on all nodes, but onTrigger method is only
>> executed on the primary node.
>>
>> This is something we've discussed trying to improve before, but the
>> real question is why are you sending data to the other nodes if you
>> don't expect the processor to execute there?
>>
>> On Wed, Jun 5, 2019 at 7:04 AM Erik-Jan <[email protected]> wrote:
>> >
>> > I figured it out after further testing. The processor runs on all nodes, 
>> > despite the explicit "run on primary node only" option that I selected. 
>> > But only on the primary node the queue is processed. On the other nodes 
>> > the queue gets filled until the max is reached after which the error 
>> > message starts appearing. What I missed before is that the message is 
>> > coming from the other, non-primary nodes.
>> > I'm not sure if this is intended behavior or if it is a bug though! For me 
>> > it's a bug since I really want this processor to run on the primary only.
>> >
>> > Op di 4 jun. 2019 16:34 schreef Erik-Jan <[email protected]>:
>> >>
>> >> Hi Bryan,
>> >>
>> >> Yes I have considerably increased the numbers in the controller settings.
>> >> I don't mind getting my hands dirty, increasing the timeout is worth a 
>> >> try.
>> >>
>> >> The errors seems to appear after quite a while. Usually I see these 
>> >> messages the next morning so testing and experimenting with this error 
>> >> takes a lot of time.
>> >>
>> >> Today I've been trying to reproduce this on a virtual machine with the 
>> >> same OS, Nifi and Java versions but to no avail. The difference is that 
>> >> this VM is not a cluster, has limited memory and cpu and still is able to 
>> >> handle much more UDP data with the error appearing only a few times so 
>> >> far after hours of running. It leads me to thinking there must be 
>> >> something in the configuration of the cluster thats causing this. I will 
>> >> also try a vanilla Nifi install on one of the nodes without clustering to 
>> >> see if my configuration and cluster setup is somehow the cause.
>> >>
>> >> Op di 4 jun. 2019 om 16:14 schreef Bryan Bende <[email protected]>:
>> >>>
>> >>> Hi Erik,
>> >>>
>> >>> It sounds like you have tried most of the common tuning options that
>> >>> can be done. I would have expected batching + increasing concurrent
>> >>> tasks from 1 to 3-5 to be the biggest improvement.
>> >>>
>> >>> Have you increased the number of threads in your overall thread pool
>> >>> according to your hardware? (from the top right menu controller
>> >>> settings)
>> >>>
>> >>> I would be curious what happens if you did some tests increasing the
>> >>> timeout where it attempts to place the message in the queue from 100ms
>> >>> to 200ms and then maybe 500ms if it still happens.
>> >>>
>> >>> I know this requires a code change since that timeout is hard-coded,
>> >>> but it sounds like you already went down that path with trying a
>> >>> different queue :)
>> >>>
>> >>> -Bryan
>> >>>
>> >>> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <[email protected]> wrote:
>> >>> >
>> >>> > Hi,
>> >>> >
>> >>> > I'm experimenting with a locally installed 3 node nifi cluster. This 
>> >>> > cluster receives UDP packets on the primary node.
>> >>> > These nodes are pretty powerful, have a good network connection, have 
>> >>> > lots of memory and SSD disks. I gave nifi 24G of java heap (xms and 
>> >>> > xmx).
>> >>> >
>> >>> > I have configured a ListenUDP processor that listens on a UDP port and 
>> >>> > it receives somewhere between 20000 to 50000 packets per 5 minutes. 
>> >>> > It's "Max size of message queue" is large enough (1M), I gave it 5 
>> >>> > concurrent tasks, it's running on the primary node only.
>> >>> >
>> >>> > The problem: after running for a while, I get the following error: 
>> >>> > "internal queue at maximum capacity, could not queue event."
>> >>> >
>> >>> > I have reviewed the source code and understand when this happens. It 
>> >>> > happens when the processor tries to store an event in a java 
>> >>> > LinkedBlockingQueue and that queue reached its maximum capacity. The 
>> >>> > offer() method has a 100ms timeout in which it waits for space to free 
>> >>> > up and then it fails and the event gets dropped. In the logs I see 
>> >>> > exactly 10 of these error messages per second (10 x 100ms is 1 
>> >>> > second). Despite these errors, I still get a very good rate of events 
>> >>> > that get through to the next processors. Actually, it seems pretty 
>> >>> > much all of the other events get through since the message rate in 
>> >>> > ListenUDP and the followup processor are very much alike. The followup 
>> >>> > processors can easily handle the load and there are no full queues, 
>> >>> > congestions or anything like that.
>> >>> >
>> >>> > What I have tried so far:
>> >>> >
>> >>> > Increasing the "Max Size of Message Queue" setting helps, but only 
>> >>> > delays the errors. They eventually return.
>> >>> >
>> >>> > Increasing heap space is a suggestion I read from a past post: I think 
>> >>> > 24G is more than enough actually? Perhaps even too much?
>> >>> >
>> >>> > Increasing parallelism: concurrent tasks set to 5 or 10 does not help.
>> >>> >
>> >>> > I modified the code to use an ArrayBlockingQueue instead of the 
>> >>> > LinkedBlockingQueue, thinking it was some kind of garbage collection. 
>> >>> > This didn't help.
>> >>> >
>> >>> > I increased "Receive Buffer Size", "Max Size of Socket Buffer" but to 
>> >>> > no avail.
>> >>> >
>> >>> > I tried batching. This helps a bit, like increasing the "Max Size of 
>> >>> > Message Queue" it only seems to delay the eventual error messages 
>> >>> > though.
>> >>> >
>> >>> > I reproduced this on my local workstation. I installed nifi, did no OS 
>> >>> > tuning at all, set the heap size to 4GB. I generate 1.3M UDP packets 
>> >>> > per 5 minutes (the max I can reach with a simple python script). With 
>> >>> > "Max Size of Message Queue" set to only 100, soon the error appears. 
>> >>> > In the ListenUDP processor I see 1.34M events out, on the followup 
>> >>> > processor I see 1.34M events incoming. The error is not as frequent as 
>> >>> > on the cluster though, only a few every couple of minutes while the 
>> >>> > data rate is much higher and the queue much smaller. I'm a bit 
>> >>> > desperate and hope anyone can help me out. Why am I getting this error 
>> >>> > on a relatively quiet cluster with not that much load?
>> >>> >
>> >>> > Best regards,
>> >>> > Erik-Jan van Baaren

Re: ListenUDP: internal queue at maximum capacity, could not queue event

Reply via email to