Re: Oversized queue between process groups

Mark Payne Sat, 31 Aug 2019 07:59:04 -0700

Jeremy,

Thanks for the details & history - I am indeed interested :) Do you see any 
errors in the logs? Particularly around a failure to update the FlowFile 
Repository? I am now thinking that you may be running into NIFI-5997 [1]. This 
appears to have affected at least all 1.x versions prior to 1.9. When a queue 
reaches a certain size (20,000 FlowFiles by default), NiFi will swap out the 
flowfiles to disk to avoid running out of memory. It does this in a batch of 
10,000 FlowFiles at a time. If there's a problem updating the FlowFile 
repository, before NIFI-5997 was addressed, the issue is that the data would be 
written to a swap file as well as staying in the queue. And the next time that 
a FlowFile came in, this would happen again. So you'd quickly see the queue 
become huge and a lot of data written to swap files, with many duplicates of 
the data.


Not 100% sure that this is what you're hitting, but it's my best hunch at the 
moment. Please do see if you have any errors logged around updating the 
FlowFile Repository.

Thanks
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-5997



On Aug 30, 2019, at 11:20 PM, Jeremy Pemberton-Pigott 
<[email protected]<mailto:[email protected]>> wrote:

Thanks for your reply Mark.

The flow was in sync between nodes, no one edited it as it was started from a 
Docker image and left running. It was running about a month. A restart didn't 
clear the queue. Only 1 node had an issue the others where clear. The flow file 
repository swap directory was about 630 GB in size on the full node. It's 
running on CentOS 7.5.

Below is just a bit of history if your interested otherwise skip it.

The cluster is running CentOS 7.5 on those 3 nodes. Nifi was configured with 
4GB of heap in the bootstrap. It's run on a partition with 1TB of free space 
(16 thread and 64GB RAM nodes). It had been running for almost a month before 
something happened and then started a backlog for about 1 week before someone 
noticed something was up. The partition was totally full on 1 node but Nifi was 
running, not processing anything of course on the full node, the other node was 
running, and the 3rd had lost its network connection on 1 card I think 
precipitating the problem so that node was not connected to the cluster.

I could see the queue was about 210 million in the UI before I shut Nifi down 
to fix things. I cleared out the log folder of the full node (around 200 GB of 
Nifi app logs, for some reason it's not rolling it correctly in this case but 
other nodes are fine) and restarted but the large queue node was giving OOM 
errors on the Jetty start up so I increased the heap to 24 GB on all nodes to 
get things started. It could run and the queue showed it was correct (I have 
encountered the queue clearing on restart before with small queues).

It began processing the queue so I left it for 2 days to recover while clearing 
out the log folder periodically to keep some drive space available (it was 
generating about 40GB of logs every few hours) and the flow file repository 
swap folder size started off at about 640 GB (normally it's just a few MB when 
it's running). But I noticed that the node would stop processing after a short 
period of time with an update attribute showing a partially full queue of 4000 
going into a funnel and the whole flow hanging with zero in/out everywhere I 
checked. Each time I restarted Nifi those small queues would clear but the same 
thing would happen.

The large queue is not critical this time so I started clearing the queue from 
the NCM and it's going at a rate of about 75k flow files per minute so I'll 
leave it running over the weekend to see how far it gets while everything else 
is still running to clear other parallel queues on that node.

Other than the one node having a large queue it is still running and the other 
nodes are working fine now. No new data is streaming in until Tuesday so I hope 
to clear the backlog on the one node by then.

Regards,

Jeremy


On 31 Aug 2019, at 03:19, Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:

Jeremy,

I'm not sure of any bugs off the top of my head that would necessarily cause 
this, but version 1.6.0 is getting fairly old, so there may well be something 
that I've forgotten about. That being said, there are two "types of bugs" that 
I think are most probable here: (1) There isn't really that much data queued up 
and NiFi is actually reporting the wrong size for the queue; or (2) perhaps one 
node in the cluster got out of sync in terms of the flow and one node actually 
is configured without backpressure being applied?

So there are two things that I would recommend checking out to help diagnose 
what is going on here. Firstly, is the huge backlog spread across all nodes or 
just on one node in the cluster? To determine this, you can go to the "Global 
menu" / Hamburger menu, and go to the Summary Page. From there, if you go to 
the Connections tab and find the connection in there (should be easy  if you 
sort the table based on queue size), you can click the button on the far-right 
that shows the Cluster view, which will break down the size of the connection 
per-node, so you know if all nodes in the cluster have a huge queue size or 
just one.

Secondly, I would be curious to know what happens if you restart the node(s) 
with the huge backlog? Do the FlowFiles magically disappear on restart, with 
the queue showing a small number (indicative of the queue size just being 
wrong), or are they still there (indicative of the queue size being correct)?

Also, what operating system are you running? There was a bug recently about 
data not being properly swapped back in on Windows but I think that was 
introduced after 1.6.0 and then fixed quickly.

This should help to know where to focus energy on finding the problem.

Thanks
-Mark

On Aug 30, 2019, at 12:24 PM, Jeremy Pemberton-Pigott 
<[email protected]<mailto:[email protected]>> wrote:

Yes there is one but not near the output port of the split json processor it's 
shortly after the input port of a child PG. The output is actually connected to 
3 child PGs and each of those has an update attribute processor on their output 
port. The other PG input port on the left is connected to a route on attribute 
processor inside it.

Queue of PG1 input-> input port to processors -> connection to 3 child PGs -> 
each PG has split json after input port -> processors -> update attribute -> 
queue to output port of child PG -> queue to output port of PG1 -> queue to PG2 
input (100s of millions in queue) -> input port to route on attribute -> ...

Regards,

Jeremy


On 30 Aug 2019, at 20:45, Bryan Bende 
<[email protected]<mailto:[email protected]>> wrote:

Can you show what is happening inside the first process group? Is there a 
SplitText processor with line count of 1?

On Fri, Aug 30, 2019 at 4:21 AM Jeremy Pemberton-Pigott 
<[email protected]<mailto:[email protected]>> wrote:
Hi Pierre,

I'm using Nifi version 1.6.0.

04/03/2018 08:16:22 UTC

Tagged nifi-1.6.0-RC3

From 7c0ee01 on branch NIFI-4995-RC3

FlowFile expiration = 0
Back pressure object threshold = 20000
Back pressure data size threshold = 1GB

The connection is just from the output port of 1 PG to the input port of 
another PG.  Inside the PG all the connections are using the same settings 
between processors.

Regards,

Jeremy

On Fri, Aug 30, 2019 at 4:14 PM Pierre Villard 
<[email protected]<mailto:[email protected]>> wrote:
Hi Jeremy,

It seems very weird that you get 200M flow files in a relationship that should 
have backpressure set at 20k flow files. While backpressure is not a hard limit 
you should not get to such numbers. Can you give us more details? What version 
of NiFi are you using? What's the configuration of your relationship between 
your two process groups?

Thanks,
Pierre

Le ven. 30 août 2019 à 07:46, Jeremy Pemberton-Pigott 
<[email protected]<mailto:[email protected]>> a écrit :
Hi,

I have a 3 node Nifi 1.6.0 cluster.  It ran out of disk space when there was a 
log jam of flow files (from slow HBase lookups).  My queue is configured for 
20,000 but 1 node has over 206 million flow files stuck in the queue.  I 
managed to clear up some disk space to get things going again but it seems that 
after a few mins of processing all the processors in the Log Parser process 
group will stop processing and show zero in/out.

Is this a bug fixed in a later version?

Each time I have to tear down the Docker containers running Nifi and restart it 
to process a few 10,000s and repeat every few mins.  Any idea what I should do 
to keep it processing the data (nifi-app.log doesn't show my anything unusual 
about the stop or delay) until the 1 node can clear the backlog?

<image.png>

Regards,

Jeremy
--
Sent from Gmail Mobile

Re: Oversized queue between process groups

Reply via email to