Re: ListenUDP: internal queue at maximum capacity, could not queue event

Joe Witt Wed, 05 Jun 2019 09:33:18 -0700

...this feels like a bug to me.  I think Erik-Jan's expectation that
nothing would have begun for ListenUDP given primary node only config is
fair.  I also think our current position of 'just not calling onTrigger' is
fair too but less intuitive for users.


What do ya'll think?

On Wed, Jun 5, 2019 at 12:10 PM Erik-Jan <[email protected]> wrote:

> This is what I do basically. I want a highly available setup with a
> minimum of components since each component can (and will) fail. Traffic
> reaches all nodes  but only a single node should read it.
>
> Op wo 5 jun. 2019 18:07 schreef James Srinivasan <
> [email protected]>:
>
>> In our case the stream is UDP broadcast, so available to all nodes
>> anyway. I've been meaning to test UDP multicast but not got round to it yet.
>>
>>
>> On Wed, 5 Jun 2019, 17:03 Bryan Bende, <[email protected]> wrote:
>>
>>> That is probably a valid point, but how about putting a load balancer
>>> in front to handle that?
>>>
>>> On Wed, Jun 5, 2019 at 11:30 AM James Srinivasan
>>> <[email protected]> wrote:
>>> >
>>> > Presumably you'd want to mirror the stream to all nodes for when the
>>> primary node changes?
>>> >
>>> > On Wed, 5 Jun 2019, 13:46 Bryan Bende, <[email protected]> wrote:
>>> >>
>>> >> The processor is started on all nodes, but onTrigger method is only
>>> >> executed on the primary node.
>>> >>
>>> >> This is something we've discussed trying to improve before, but the
>>> >> real question is why are you sending data to the other nodes if you
>>> >> don't expect the processor to execute there?
>>> >>
>>> >> On Wed, Jun 5, 2019 at 7:04 AM Erik-Jan <[email protected]> wrote:
>>> >> >
>>> >> > I figured it out after further testing. The processor runs on all
>>> nodes, despite the explicit "run on primary node only" option that I
>>> selected. But only on the primary node the queue is processed. On the other
>>> nodes the queue gets filled until the max is reached after which the error
>>> message starts appearing. What I missed before is that the message is
>>> coming from the other, non-primary nodes.
>>> >> > I'm not sure if this is intended behavior or if it is a bug though!
>>> For me it's a bug since I really want this processor to run on the primary
>>> only.
>>> >> >
>>> >> > Op di 4 jun. 2019 16:34 schreef Erik-Jan <[email protected]>:
>>> >> >>
>>> >> >> Hi Bryan,
>>> >> >>
>>> >> >> Yes I have considerably increased the numbers in the controller
>>> settings.
>>> >> >> I don't mind getting my hands dirty, increasing the timeout is
>>> worth a try.
>>> >> >>
>>> >> >> The errors seems to appear after quite a while. Usually I see
>>> these messages the next morning so testing and experimenting with this
>>> error takes a lot of time.
>>> >> >>
>>> >> >> Today I've been trying to reproduce this on a virtual machine with
>>> the same OS, Nifi and Java versions but to no avail. The difference is that
>>> this VM is not a cluster, has limited memory and cpu and still is able to
>>> handle much more UDP data with the error appearing only a few times so far
>>> after hours of running. It leads me to thinking there must be something in
>>> the configuration of the cluster thats causing this. I will also try a
>>> vanilla Nifi install on one of the nodes without clustering to see if my
>>> configuration and cluster setup is somehow the cause.
>>> >> >>
>>> >> >> Op di 4 jun. 2019 om 16:14 schreef Bryan Bende <[email protected]>:
>>> >> >>>
>>> >> >>> Hi Erik,
>>> >> >>>
>>> >> >>> It sounds like you have tried most of the common tuning options
>>> that
>>> >> >>> can be done. I would have expected batching + increasing
>>> concurrent
>>> >> >>> tasks from 1 to 3-5 to be the biggest improvement.
>>> >> >>>
>>> >> >>> Have you increased the number of threads in your overall thread
>>> pool
>>> >> >>> according to your hardware? (from the top right menu controller
>>> >> >>> settings)
>>> >> >>>
>>> >> >>> I would be curious what happens if you did some tests increasing
>>> the
>>> >> >>> timeout where it attempts to place the message in the queue from
>>> 100ms
>>> >> >>> to 200ms and then maybe 500ms if it still happens.
>>> >> >>>
>>> >> >>> I know this requires a code change since that timeout is
>>> hard-coded,
>>> >> >>> but it sounds like you already went down that path with trying a
>>> >> >>> different queue :)
>>> >> >>>
>>> >> >>> -Bryan
>>> >> >>>
>>> >> >>> On Tue, Jun 4, 2019 at 4:28 AM Erik-Jan <[email protected]>
>>> wrote:
>>> >> >>> >
>>> >> >>> > Hi,
>>> >> >>> >
>>> >> >>> > I'm experimenting with a locally installed 3 node nifi cluster.
>>> This cluster receives UDP packets on the primary node.
>>> >> >>> > These nodes are pretty powerful, have a good network
>>> connection, have lots of memory and SSD disks. I gave nifi 24G of java heap
>>> (xms and xmx).
>>> >> >>> >
>>> >> >>> > I have configured a ListenUDP processor that listens on a UDP
>>> port and it receives somewhere between 20000 to 50000 packets per 5
>>> minutes. It's "Max size of message queue" is large enough (1M), I gave it 5
>>> concurrent tasks, it's running on the primary node only.
>>> >> >>> >
>>> >> >>> > The problem: after running for a while, I get the following
>>> error: "internal queue at maximum capacity, could not queue event."
>>> >> >>> >
>>> >> >>> > I have reviewed the source code and understand when this
>>> happens. It happens when the processor tries to store an event in a java
>>> LinkedBlockingQueue and that queue reached its maximum capacity. The
>>> offer() method has a 100ms timeout in which it waits for space to free up
>>> and then it fails and the event gets dropped. In the logs I see exactly 10
>>> of these error messages per second (10 x 100ms is 1 second). Despite these
>>> errors, I still get a very good rate of events that get through to the next
>>> processors. Actually, it seems pretty much all of the other events get
>>> through since the message rate in ListenUDP and the followup processor are
>>> very much alike. The followup processors can easily handle the load and
>>> there are no full queues, congestions or anything like that.
>>> >> >>> >
>>> >> >>> > What I have tried so far:
>>> >> >>> >
>>> >> >>> > Increasing the "Max Size of Message Queue" setting helps, but
>>> only delays the errors. They eventually return.
>>> >> >>> >
>>> >> >>> > Increasing heap space is a suggestion I read from a past post:
>>> I think 24G is more than enough actually? Perhaps even too much?
>>> >> >>> >
>>> >> >>> > Increasing parallelism: concurrent tasks set to 5 or 10 does
>>> not help.
>>> >> >>> >
>>> >> >>> > I modified the code to use an ArrayBlockingQueue instead of the
>>> LinkedBlockingQueue, thinking it was some kind of garbage collection. This
>>> didn't help.
>>> >> >>> >
>>> >> >>> > I increased "Receive Buffer Size", "Max Size of Socket Buffer"
>>> but to no avail.
>>> >> >>> >
>>> >> >>> > I tried batching. This helps a bit, like increasing the "Max
>>> Size of Message Queue" it only seems to delay the eventual error messages
>>> though.
>>> >> >>> >
>>> >> >>> > I reproduced this on my local workstation. I installed nifi,
>>> did no OS tuning at all, set the heap size to 4GB. I generate 1.3M UDP
>>> packets per 5 minutes (the max I can reach with a simple python script).
>>> With "Max Size of Message Queue" set to only 100, soon the error appears.
>>> In the ListenUDP processor I see 1.34M events out, on the followup
>>> processor I see 1.34M events incoming. The error is not as frequent as on
>>> the cluster though, only a few every couple of minutes while the data rate
>>> is much higher and the queue much smaller. I'm a bit desperate and hope
>>> anyone can help me out. Why am I getting this error on a relatively quiet
>>> cluster with not that much load?
>>> >> >>> >
>>> >> >>> > Best regards,
>>> >> >>> > Erik-Jan van Baaren
>>>
>>

Re: ListenUDP: internal queue at maximum capacity, could not queue event

Reply via email to