Re: Oversized queue between process groups

Jeremy Pemberton-Pigott Tue, 17 Sep 2019 19:35:16 -0700

I checked the logs that I can find but nothing useful, most of it is gone
because the drive was full of previous logs from it generating errors about
it not being able to write or update any files.  The logs are all gone now
as part of the recovery process removed all the logs.  What I do notice is
that if it received a large number of files from a list processor for
example, say 300000 files, it isn't swapping the files back in after
swapping them out of the queue.  The queue will show files to be processed
and the next processor in the flow will show a very high tasks count in the
millions but not processing anything.  I experience the same problem with
1.9.2 and I'm trying to build a replicable flow to help clarify the
problem.  For this issue between groups it could a similar problem but I've
not been able to replicate it yet as I'm a little busy at the moment.


Regards,

Jeremy

On Sat, Aug 31, 2019 at 10:58 PM Mark Payne <[email protected]> wrote:

> Jeremy,
>
> Thanks for the details & history - I am indeed interested :) Do you see
> any errors in the logs? Particularly around a failure to update the
> FlowFile Repository? I am now thinking that you may be running into
> NIFI-5997 [1]. This appears to have affected at least all 1.x versions
> prior to 1.9. When a queue reaches a certain size (20,000 FlowFiles by
> default), NiFi will swap out the flowfiles to disk to avoid running out of
> memory. It does this in a batch of 10,000 FlowFiles at a time. If there's a
> problem updating the FlowFile repository, before NIFI-5997 was addressed,
> the issue is that the data would be written to a swap file as well as
> staying in the queue. And the next time that a FlowFile came in, this would
> happen again. So you'd quickly see the queue become huge and a lot of data
> written to swap files, with many duplicates of the data.
>
> Not 100% sure that this is what you're hitting, but it's my best hunch at
> the moment. Please do see if you have any errors logged around updating the
> FlowFile Repository.
>
> Thanks
> -Mark
>
>
> [1] https://issues.apache.org/jira/browse/NIFI-5997
>
>
>
> On Aug 30, 2019, at 11:20 PM, Jeremy Pemberton-Pigott <
> [email protected]> wrote:
>
> Thanks for your reply Mark.
>
> The flow was in sync between nodes, no one edited it as it was started
> from a Docker image and left running. It was running about a month. A
> restart didn't clear the queue. Only 1 node had an issue the others where
> clear. The flow file repository swap directory was about 630 GB in size on
> the full node. It's running on CentOS 7.5.
>
> Below is just a bit of history if your interested otherwise skip it.
>
> The cluster is running CentOS 7.5 on those 3 nodes. Nifi was configured
> with 4GB of heap in the bootstrap. It's run on a partition with 1TB of free
> space (16 thread and 64GB RAM nodes). It had been running for almost a
> month before something happened and then started a backlog for about 1 week
> before someone noticed something was up. The partition was totally full on
> 1 node but Nifi was running, not processing anything of course on the full
> node, the other node was running, and the 3rd had lost its network
> connection on 1 card I think precipitating the problem so that node was not
> connected to the cluster.
>
> I could see the queue was about 210 million in the UI before I shut Nifi
> down to fix things. I cleared out the log folder of the full node (around
> 200 GB of Nifi app logs, for some reason it's not rolling it correctly in
> this case but other nodes are fine) and restarted but the large queue node
> was giving OOM errors on the Jetty start up so I increased the heap to 24
> GB on all nodes to get things started. It could run and the queue showed it
> was correct (I have encountered the queue clearing on restart before with
> small queues).
>
> It began processing the queue so I left it for 2 days to recover while
> clearing out the log folder periodically to keep some drive space available
> (it was generating about 40GB of logs every few hours) and the flow file
> repository swap folder size started off at about 640 GB (normally it's just
> a few MB when it's running). But I noticed that the node would stop
> processing after a short period of time with an update attribute showing a
> partially full queue of 4000 going into a funnel and the whole flow hanging
> with zero in/out everywhere I checked. Each time I restarted Nifi those
> small queues would clear but the same thing would happen.
>
> The large queue is not critical this time so I started clearing the queue
> from the NCM and it's going at a rate of about 75k flow files per minute so
> I'll leave it running over the weekend to see how far it gets while
> everything else is still running to clear other parallel queues on that
> node.
>
> Other than the one node having a large queue it is still running and the
> other nodes are working fine now. No new data is streaming in until Tuesday
> so I hope to clear the backlog on the one node by then.
>
> Regards,
>
> Jeremy
>
>
> On 31 Aug 2019, at 03:19, Mark Payne <[email protected]> wrote:
>
> Jeremy,
>
> I'm not sure of any bugs off the top of my head that would necessarily
> cause this, but version 1.6.0 is getting fairly old, so there may well be
> something that I've forgotten about. That being said, there are two "types
> of bugs" that I think are most probable here: (1) There isn't really that
> much data queued up and NiFi is actually reporting the wrong size for the
> queue; or (2) perhaps one node in the cluster got out of sync in terms of
> the flow and one node actually is configured without backpressure being
> applied?
>
> So there are two things that I would recommend checking out to help
> diagnose what is going on here. Firstly, is the huge backlog spread across
> all nodes or just on one node in the cluster? To determine this, you can go
> to the "Global menu" / Hamburger menu, and go to the Summary Page. From
> there, if you go to the Connections tab and find the connection in there
> (should be easy  if you sort the table based on queue size), you can click
> the button on the far-right that shows the Cluster view, which will break
> down the size of the connection per-node, so you know if all nodes in the
> cluster have a huge queue size or just one.
>
> Secondly, I would be curious to know what happens if you restart the
> node(s) with the huge backlog? Do the FlowFiles magically disappear on
> restart, with the queue showing a small number (indicative of the queue
> size just being wrong), or are they still there (indicative of the queue
> size being correct)?
>
> Also, what operating system are you running? There was a bug recently
> about data not being properly swapped back in on Windows but I think that
> was introduced after 1.6.0 and then fixed quickly.
>
> This should help to know where to focus energy on finding the problem.
>
> Thanks
> -Mark
>
> On Aug 30, 2019, at 12:24 PM, Jeremy Pemberton-Pigott <
> [email protected]> wrote:
>
> Yes there is one but not near the output port of the split json processor
> it's shortly after the input port of a child PG. The output is actually
> connected to 3 child PGs and each of those has an update attribute
> processor on their output port. The other PG input port on the left is
> connected to a route on attribute processor inside it.
>
> Queue of PG1 input-> input port to processors -> connection to 3 child PGs
> -> each PG has split json after input port -> processors -> update
> attribute -> queue to output port of child PG -> queue to output port of
> PG1 -> queue to PG2 input (100s of millions in queue) -> input port to
> route on attribute -> ...
>
> Regards,
>
> Jeremy
>
>
> On 30 Aug 2019, at 20:45, Bryan Bende <[email protected]> wrote:
>
> Can you show what is happening inside the first process group? Is there a
> SplitText processor with line count of 1?
>
> On Fri, Aug 30, 2019 at 4:21 AM Jeremy Pemberton-Pigott <
> [email protected]> wrote:
>
>> Hi Pierre,
>>
>> I'm using Nifi version 1.6.0.
>>
>> 04/03/2018 08:16:22 UTC
>>
>> Tagged nifi-1.6.0-RC3
>>
>> From 7c0ee01 on branch NIFI-4995-RC3
>> FlowFile expiration = 0
>> Back pressure object threshold = 20000
>> Back pressure data size threshold = 1GB
>>
>> The connection is just from the output port of 1 PG to the input port of
>> another PG.  Inside the PG all the connections are using the same settings
>> between processors.
>>
>> Regards,
>>
>> Jeremy
>>
>> On Fri, Aug 30, 2019 at 4:14 PM Pierre Villard <
>> [email protected]> wrote:
>>
>>> Hi Jeremy,
>>>
>>> It seems very weird that you get 200M flow files in a relationship that
>>> should have backpressure set at 20k flow files. While backpressure is not a
>>> hard limit you should not get to such numbers. Can you give us more
>>> details? What version of NiFi are you using? What's the configuration of
>>> your relationship between your two process groups?
>>>
>>> Thanks,
>>> Pierre
>>>
>>> Le ven. 30 août 2019 à 07:46, Jeremy Pemberton-Pigott <
>>> [email protected]> a écrit :
>>>
>>>> Hi,
>>>>
>>>> I have a 3 node Nifi 1.6.0 cluster.  It ran out of disk space when
>>>> there was a log jam of flow files (from slow HBase lookups).  My queue is
>>>> configured for 20,000 but 1 node has over 206 million flow files stuck in
>>>> the queue.  I managed to clear up some disk space to get things going again
>>>> but it seems that after a few mins of processing all the processors in the
>>>> Log Parser process group will stop processing and show zero in/out.
>>>>
>>>> Is this a bug fixed in a later version?
>>>>
>>>> Each time I have to tear down the Docker containers running Nifi and
>>>> restart it to process a few 10,000s and repeat every few mins.  Any idea
>>>> what I should do to keep it processing the data (nifi-app.log doesn't show
>>>> my anything unusual about the stop or delay) until the 1 node can clear the
>>>> backlog?
>>>>
>>>> <image.png>
>>>>
>>>> Regards,
>>>>
>>>> Jeremy
>>>>
>>> --
> Sent from Gmail Mobile
>
>
>
>

Re: Oversized queue between process groups

Reply via email to