Re: Oversized queue between process groups

Jeremy Pemberton-Pigott Fri, 30 Aug 2019 20:21:42 -0700

Thanks for your reply Mark.

The flow was in sync between nodes, no one edited it as it was started from a 
Docker image and left running. It was running about a month. A restart didn't 
clear the queue. Only 1 node had an issue the others where clear. The flow file 
repository swap directory was about 630 GB in size on the full node. It's 
running on CentOS 7.5.

Below is just a bit of history if your interested otherwise skip it. 

The cluster is running CentOS 7.5 on those 3 nodes. Nifi was configured with 
4GB of heap in the bootstrap. It's run on a partition with 1TB of free space 
(16 thread and 64GB RAM nodes). It had been running for almost a month before 
something happened and then started a backlog for about 1 week before someone 
noticed something was up. The partition was totally full on 1 node but Nifi was 
running, not processing anything of course on the full node, the other node was 
running, and the 3rd had lost its network connection on 1 card I think 
precipitating the problem so that node was not connected to the cluster.

I could see the queue was about 210 million in the UI before I shut Nifi down 
to fix things. I cleared out the log folder of the full node (around 200 GB of 
Nifi app logs, for some reason it's not rolling it correctly in this case but 
other nodes are fine) and restarted but the large queue node was giving OOM 
errors on the Jetty start up so I increased the heap to 24 GB on all nodes to 
get things started. It could run and the queue showed it was correct (I have 
encountered the queue clearing on restart before with small queues). 

It began processing the queue so I left it for 2 days to recover while clearing 
out the log folder periodically to keep some drive space available (it was 
generating about 40GB of logs every few hours) and the flow file repository 
swap folder size started off at about 640 GB (normally it's just a few MB when 
it's running). But I noticed that the node would stop processing after a short 
period of time with an update attribute showing a partially full queue of 4000 
going into a funnel and the whole flow hanging with zero in/out everywhere I 
checked. Each time I restarted Nifi those small queues would clear but the same 
thing would happen. 

The large queue is not critical this time so I started clearing the queue from 
the NCM and it's going at a rate of about 75k flow files per minute so I'll 
leave it running over the weekend to see how far it gets while everything else 
is still running to clear other parallel queues on that node. 

Other than the one node having a large queue it is still running and the other 
nodes are working fine now. No new data is streaming in until Tuesday so I hope 
to clear the backlog on the one node by then. 

Regards,

Jeremy

On 31 Aug 2019, at 03:19, Mark Payne <[email protected]> wrote:

Jeremy,

I'm not sure of any bugs off the top of my head that would necessarily cause 
this, but version 1.6.0 is getting fairly old, so there may well be something 
that I've forgotten about. That being said, there are two "types of bugs" that 
I think are most probable here: (1) There isn't really that much data queued up 
and NiFi is actually reporting the wrong size for the queue; or (2) perhaps one 
node in the cluster got out of sync in terms of the flow and one node actually 
is configured without backpressure being applied?

So there are two things that I would recommend checking out to help diagnose 
what is going on here. Firstly, is the huge backlog spread across all nodes or 
just on one node in the cluster? To determine this, you can go to the "Global 
menu" / Hamburger menu, and go to the Summary Page. From there, if you go to 
the Connections tab and find the connection in there (should be easy  if you 
sort the table based on queue size), you can click the button on the far-right 
that shows the Cluster view, which will break down the size of the connection 
per-node, so you know if all nodes in the cluster have a huge queue size or 
just one.

Secondly, I would be curious to know what happens if you restart the node(s) 
with the huge backlog? Do the FlowFiles magically disappear on restart, with 
the queue showing a small number (indicative of the queue size just being 
wrong), or are they still there (indicative of the queue size being correct)?

Also, what operating system are you running? There was a bug recently about 
data not being properly swapped back in on Windows but I think that was 
introduced after 1.6.0 and then fixed quickly.

This should help to know where to focus energy on finding the problem.

Thanks
-Mark

> On Aug 30, 2019, at 12:24 PM, Jeremy Pemberton-Pigott <[email protected]> 
> wrote:
> 
> Yes there is one but not near the output port of the split json processor 
> it's shortly after the input port of a child PG. The output is actually 
> connected to 3 child PGs and each of those has an update attribute processor 
> on their output port. The other PG input port on the left is connected to a 
> route on attribute processor inside it. 
> 
> Queue of PG1 input-> input port to processors -> connection to 3 child PGs -> 
> each PG has split json after input port -> processors -> update attribute -> 
> queue to output port of child PG -> queue to output port of PG1 -> queue to 
> PG2 input (100s of millions in queue) -> input port to route on attribute -> 
> ...
> 
> Regards,
> 
> Jeremy
> 
> 
> On 30 Aug 2019, at 20:45, Bryan Bende <[email protected]> wrote:
> 
> Can you show what is happening inside the first process group? Is there a 
> SplitText processor with line count of 1? 
> 
>> On Fri, Aug 30, 2019 at 4:21 AM Jeremy Pemberton-Pigott 
>> <[email protected]> wrote:
>> Hi Pierre,
>> 
>> I'm using Nifi version 1.6.0.
>> 04/03/2018 08:16:22 UTC
>> 
>> Tagged nifi-1.6.0-RC3
>> 
>> From 7c0ee01 on branch NIFI-4995-RC3
>> 
>> FlowFile expiration = 0
>> Back pressure object threshold = 20000
>> Back pressure data size threshold = 1GB
>> 
>> The connection is just from the output port of 1 PG to the input port of 
>> another PG.  Inside the PG all the connections are using the same settings 
>> between processors.
>> 
>> Regards,
>> 
>> Jeremy
>> 
>>> On Fri, Aug 30, 2019 at 4:14 PM Pierre Villard 
>>> <[email protected]> wrote:
>>> Hi Jeremy,
>>> 
>>> It seems very weird that you get 200M flow files in a relationship that 
>>> should have backpressure set at 20k flow files. While backpressure is not a 
>>> hard limit you should not get to such numbers. Can you give us more 
>>> details? What version of NiFi are you using? What's the configuration of 
>>> your relationship between your two process groups?
>>> 
>>> Thanks,
>>> Pierre
>>> 
>>>> Le ven. 30 août 2019 à 07:46, Jeremy Pemberton-Pigott 
>>>> <[email protected]> a écrit :
>>>> Hi,
>>>> 
>>>> I have a 3 node Nifi 1.6.0 cluster.  It ran out of disk space when there 
>>>> was a log jam of flow files (from slow HBase lookups).  My queue is 
>>>> configured for 20,000 but 1 node has over 206 million flow files stuck in 
>>>> the queue.  I managed to clear up some disk space to get things going 
>>>> again but it seems that after a few mins of processing all the processors 
>>>> in the Log Parser process group will stop processing and show zero in/out.
>>>> 
>>>> Is this a bug fixed in a later version?
>>>> 
>>>> Each time I have to tear down the Docker containers running Nifi and 
>>>> restart it to process a few 10,000s and repeat every few mins.  Any idea 
>>>> what I should do to keep it processing the data (nifi-app.log doesn't show 
>>>> my anything unusual about the stop or delay) until the 1 node can clear 
>>>> the backlog?
>>>> 
>>>> <image.png>
>>>> 
>>>> Regards,
>>>> 
>>>> Jeremy
> -- 
> Sent from Gmail Mobile

Re: Oversized queue between process groups

Reply via email to