Re: Tuning for flow with lots of processors

Eric Secules Thu, 26 Nov 2020 09:50:29 -0800

Hi Mark,

It was because the main disk was filling up! We increased the disk size to
128GB and speed improved!


Thanks,
Eric

On Wed., Nov. 25, 2020, 12:34 p.m. Eric Secules, <esecu...@gmail.com> wrote:

> Hi Mark,
>
> Thanks for the quick response, I grepped the logs and did find several
> hits! I will try increasing the disk space from 30GB to 128GB and hopefully
> that will speed things up.
>
> 2020-11-25 19:33:50,416 INFO [Timer-Driven Process Thread-4]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:37:10,649 INFO [Timer-Driven Process Thread-9]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:37:10,877 INFO [Timer-Driven Process Thread-3]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:37:20,376 INFO [Timer-Driven Process Thread-2]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:50:11,195 INFO [Timer-Driven Process Thread-3]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:50:22,974 INFO [Timer-Driven Process Thread-6]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:50:30,002 INFO [Timer-Driven Process Thread-7]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:53:31,591 INFO [Timer-Driven Process Thread-6]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:54:00,707 INFO [Timer-Driven Process Thread-4]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:54:10,016 INFO [Timer-Driven Process Thread-2]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:54:11,148 INFO [Timer-Driven Process Thread-3]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 19:54:26,104 INFO [Timer-Driven Process Thread-5]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> tail: '/opt/nifi/nifi-current/logs/nifi-app.log' has been replaced;
>  following new file
> 2020-11-25 20:01:27,376 INFO [Timer-Driven Process Thread-10]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:08:06,446 INFO [Timer-Driven Process Thread-6]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:08:06,485 INFO [Timer-Driven Process Thread-8]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:08:07,354 INFO [Timer-Driven Process Thread-3]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:08:10,001 INFO [Timer-Driven Process Thread-2]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:08:10,816 INFO [Timer-Driven Process Thread-10]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:14:11,145 INFO [Timer-Driven Process Thread-9]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:14:11,150 INFO [Timer-Driven Process Thread-6]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:14:11,157 INFO [Timer-Driven Process Thread-4]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:14:20,002 INFO [Timer-Driven Process Thread-3]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:17:06,638 INFO [Timer-Driven Process Thread-9]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:17:06,807 INFO [Timer-Driven Process Thread-4]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:17:06,909 INFO [Timer-Driven Process Thread-8]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:17:25,955 INFO [Timer-Driven Process Thread-1]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:21:18,652 INFO [Timer-Driven Process Thread-10]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:21:20,002 INFO [Timer-Driven Process Thread-5]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:22:47,868 INFO [Timer-Driven Process Thread-7]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:22:48,224 INFO [Timer-Driven Process Thread-2]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:22:48,225 INFO [Timer-Driven Process Thread-8]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:22:49,451 INFO [Timer-Driven Process Thread-1]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:26:13,592 INFO [Timer-Driven Process Thread-7]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:26:13,752 INFO [Timer-Driven Process Thread-3]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:26:14,363 INFO [Timer-Driven Process Thread-2]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
> 2020-11-25 20:26:20,003 INFO [Timer-Driven Process Thread-4]
> o.a.n.c.repository.FileSystemRepository Unable to write to container
> default due to archive file size constraints; waiting for archive cleanup
>
>
> -Eric
>
>
> On Wed, Nov 25, 2020 at 12:28 PM Mark Payne <marka...@hotmail.com> wrote:
>
>> Eric,
>>
>> As I was reading through your response, I was going to ask about how much
>> of the storage space is used and what the value of
>> nifi.content.repository.archive.max.usage.percentage is set to.  If you
>> look in the logs I’m guessing you’ll see some logs like "Unable to write to
>> container XYZ due to archive file size constraints; waiting for archive
>> cleanup”
>>
>> If you’re seeing that, what it’s basically tell you is that the Content
>> Repository is applying backpressure to prevent you from running out of disk
>> space. If you set the
>> “nifi.content.repository.archive.max.usage.percentage” property to say
>> “90%” you’ll probably see the performance improve and avoid the sporadic
>> conditions that you’re seeing. But depending on the bursty-ness of your
>> data, at 90% you could potentially risk running out of disk space. Of
>> course if you’re already sitting at 79% you may also want to just increase
>> the amount of disk space that you have.
>>
>> Thanks
>> -Mark
>>
>>
>> > On Nov 25, 2020, at 3:21 PM, Eric Secules <esecu...@gmail.com> wrote:
>> >
>> > Thanks for the tips Mark!
>> >
>> > I looked at the summary and there are a fair number of processors at
>> the top of the list which create flowfiles from an outside source, but
>> there are also several anomalies. For example there are some processors
>> that are mid-flow which usually process each flowfile in less than a
>> second, but sometimes average execution time balloons to several seconds
>> and don't correlate with flowfile size. I see this behavior on network IO
>> bound processors (most IO is done to docker containers on the same host),
>> and even data processors like ReplaceText (full text regex) where I saw
>> execution time go up to 30 seconds even though the input file size and
>> content is always the same(22.5KB) It usually takes a couple milliseconds
>> to process the same file.
>> >
>> > I see no backpressure in the flow, but depending on when I look at the
>> summary I don't always see the anomalies I mentioned above, sometimes it
>> looks totally acceptable. But other times I see processors like ReplaceText
>> (which only has one concurrent task) be active for 4 of the past 5 minutes.
>> >
>> > I tried looking at the disk metrics in Azure, and it says we aren't
>> near our quota of 120 IOPS and we do have burst capacity of up to 3100
>> IOPS. During testing we are steady at about 20 IOPS.
>> >
>> > All 3 file repositories (content, provenance, flow file) are stored on
>> the OS disk which is at 79% capacity. Could constant pruning of the content
>> repo be the cause of the intermittent slowness? We have the following
>> setting: nifi.content.repository.archive.max.usage.percentage=50%
>> >
>> > Thanks,
>> > Eric
>> >
>> > On Wed, Nov 25, 2020 at 6:31 AM Mark Payne <marka...@hotmail.com>
>> wrote:
>> > Eric,
>> >
>> > There’s nothing that I know of that went into 1.12.1 that would cause
>> the dataflow to be any slower. And I’d expect to have heard about it from
>> others if there were. There is a chance, though, that a particular
>> processor that you’re using is slower, due to a newer library perhaps, or
>> code changes. Of note, I don’t think it’s necessarily more CPU intensive,
>> if you’re still seeing a load average of only 3.5 - that’s quite low.
>> >
>> > My recommendation would be to run your test suite. Give it a good 15
>> minutes or so to get into the thick of things. Then look at two things to
>> determine where the bottleneck is. You’ll want to look for any backpressure
>> first (the label on the Connection between processors would become red).
>> That’s a dead giveaway of where the bottleneck is, if that’s kicking in.
>> The next thing to check is going to the summary table (global menu, aka
>> hamburger menu, and then Summary). Go to the processors tab and sort by
>> task time. This will tell you which processors are taking the most time to
>> run.
>> >
>> > In general, though, if backpressure is being applied, the destination
>> of that connection is the bottleneck. If multiple connections in sequence
>> have backpressure applied, look to the last one in the chain, as it’s
>> causing the backpressure to propagate back. If there is no backpressure
>> applied, then that means that your data flow is able to keep up with the
>> rate of data that’s coming in. So that would imply that the source
>> processor is not able to bring the data in as fast as you’d like. That
>> could be due to NiFi (which would imply your disk is probably not fast
>> enough, since clearly your CPU has more to give) or that the source of the
>> data is not providing the data fast enough, etc. You could also try
>> increasing the number of Concurrent Tasks on the source processor, and
>> perhaps using a second thread will improve the performance.
>> >
>> > Thanks
>> > -Mark
>> >
>> >
>> >> On Nov 24, 2020, at 5:40 PM, Eric Secules <esecu...@gmail.com> wrote:
>> >>
>> >> Hi Mark,
>> >>
>> >> Watching the video now, and will plan to watch more of the series.
>> Thanks! As for questions,
>> >>
>> >> I have NiFi on my macbook pro running in docker and give Docker VM 10
>> of my 12 cores and on a smaller test environment. I am seeing performance
>> issues in both places. In my test environment we run it on a Standard D4s
>> v3 (4 vcpus, 16 GiB memory) VM with a single 30GB Premium SSD (120 IOPS, 25
>> Mbps). NiFi also runs in a docker container. Right now we use the standard
>> number of thread pool threads (10). At any given time, even if I increase
>> the number of threads in the pool I don't see the number of active
>> processors go above 10. So I don't think increasing the size of the pool
>> will help. My test VM has 4 cores and a load average of 3.5 over the past
>> minute. And Azure monitoring shows me that the VM doesn't go above 50%
>> average CPU usage while NiFi is under load. The disk is currently 70% full.
>> Up until last month a full test suite would take about 30-40 minutes, and
>> now it's pushing 4 hours. We started noticing tests taking a while shortly
>> after upgrading NiFi to 1.12.1 from 1.11.4.
>> >>
>> >> We don't configure ridiculous amounts of concurrent tasks to
>> processors. Is it possible that between 1.11.4 and 1.12.1 NiFi became a lot
>> more CPU intensive?
>> >>
>> >> Thanks,
>> >> Eric
>> >>
>> >>
>> >>
>> >> On Tue, Nov 24, 2020 at 6:55 AM Mark Payne <marka...@hotmail.com>
>> wrote:
>> >> Eric,
>> >>
>> >> I don’t think there’s really any metric that exposes the specific
>> numbers you’re looking for. Certainly you could run a Java profiler and
>> look at the results to see where all of the time is being spent. But that
>> may get into more details than you’re comfortable sorting through,
>> depending on your knowledge of Java, profilers, and nifi internals.
>> >>
>> >> The nifi.bored.yield.duration is definitely an important property when
>> you’ve got lots of processors that aren’t really doing anything. You can
>> increase that if you are okay adding potential latency into your dataflow.
>> That said, 10 milliseconds is the default and generally works quite well,
>> even with many thousands of processors. Of course, it also depends on how
>> many cpu cores you have, etc.
>> >>
>> >> As for whether or not increasing the number of timer-driven threads
>> will help, that very much depends on several things. How many threads are
>> being used? How many CPU cores do you have? How many are being used? There
>> are a series of videos on YouTube where I’ve discussed nifi anti-patterns.
>> One of those [1] discusses how to tune the Timer-Driven Thread Pool, which
>> may be helpful to you.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >> [1] https://www.youtube.com/watch?v=pZq0EbfDBy4
>> >>
>> >>
>> >>> On Nov 23, 2020, at 7:55 PM, Eric Secules <esecu...@gmail.com> wrote:
>> >>>
>> >>> Hello everyone,
>> >>>
>> >>> I was wondering if there was a metric for the amount of time
>> tImer-driven processors spend in a queue ready and waiting to be run. I use
>> NiFi in an atypical way and my flow has over 2000 processors running on a
>> single node, but there are usually less than 10 connections that have one
>> or more flowfiles in them at any given time.
>> >>>
>> >>> I have a theory that the number of processors in use is slowing down
>> the system overall. But I would need to see some more metrics to know
>> whether that's the case and tell whether anything I am doing is helping.
>> Are there some logs that I could look for or internal stats I could poke at
>> with a debugger?
>> >>>
>> >>> Should I be able to see increased throughput by increasing the number
>> of timer-driven threads, or is there a different mechanism responsible for
>> going through all the runnable processors to see whether they have input to
>> process. I also noticed "nifi.bored.yield.duration" would it be good to
>> increase the yield duration in this setting?
>> >>>
>> >>> Thanks,
>> >>> Eric
>> >>
>> >
>>
>>

Re: Tuning for flow with lots of processors

Reply via email to