Hi Adam,

Thank you very much! This is exactly the kind of context and experience I was 
hoping for.

The scenarios you describe are what has/had me stumped. Most NiFi deployments I 
manage have a mix of fairly static high-volume sensor data flows and rapidly 
developing data transformation flows. It would make sense to split those up 
into a scaling set of standalone NiFi or Minifi containers and a fixed cluster 
for the developers to work their transformation magic, but that drastically 
increases the complexity compared to the current approach of generously sized 
VM clusters even if it could save on resources used.

Also, thanks for bringing up Minifi. We’ve abandoned it some years ago when it 
fell behind the NiFi versions too far to be easily managed, but with it being 
rolled into the NiFi codebase and a really good fit for containers I should 
give it another try.

Kind regards,

Isha

Van: Adam Taft <[email protected]>
Verzonden: woensdag 8 februari 2023 23:17
Aan: [email protected]
Onderwerp: Re: How do you use container-based NiFi ?

Isha,

Just some perspective from the field. I have had success with containerized 
NiFi and generally get along with it. That being said, I think there a few 
caveats and issues you might find going down this road.

Standalone NiFi in a container works pretty much the way you would want and 
expect. You do need to be careful about where you are mounting your NiFi 
configuration directories, though. e.g. content_repository, 
database_repository, flowfile_repository, provenance_repository, state, logs 
and work. All of these directories are actively written by NiFi and it's good 
to have these exported as bind mounts external from the container.

You will definitely want to bind mount the flow.xml.gz and flow.json.gz files 
as well, or you will lose your live dataflow configuration changes as you use 
NiFi. Any change to your nifi canvas gets written into flow.xml.gz, which means 
you need to keep a copy of it outside of your container. And there's 
potentially other files in the conf folder that you also want to keep around. 
NiFi unfortunately doesn't organize the location of all these directories into 
a single location by default, so you kind of have to reconfigure and/or bind 
mount a lot of different paths.

I have found that NiFi clustering with a dockerized environment to be less 
desirable. Primarily the problem is that the definition of cluster nodes is 
mostly hard coded into the nifi.properties file. Usually in a containerized 
environment, you want the ability to dynamically bring nodes up/down as needed 
(with dynamic IP/network configuration), especially in container orchestration 
frameworks like kubernetes. There's been a lot of experiments and possibly even 
some reasonable solutions coming out to help with containerized clusters, but 
generally you're going to find you have to crack your knuckles a little bit to 
get this to work. If you're content with a mostly statically defined 
non-elastic cluster configuration, then a clustered NiFi on docker is possible.

As an option, if you stick with standalone deployments, what you can instead do 
instead is front your individual NiFi node instances with a load balancer. This 
may be a poor-man's approach to load distribution, but it works reasonably well 
and I've seen it in action on large volume flows. If you have the option that 
your data source can deliver to a load balancer, then you can have the load 
balancer round-robin (or similar) to your underlying standalone nodes. In a 
container orchestration environment, you can imagine kubernetes being able to 
spin up and spin down containerized nodes to handle demand, and managing a load 
balancer configuration as those nodes are coming up. It's all possible, but 
will require some work.

Of course, doing anything with multiple standalone nodes, means that you have 
to propagate changes from one NiFi canvas to all your nodes manually. This is a 
huge pain and not really scalable. So the load balancer approach is only good 
if your dataflow configurations are very static and don't change day-to-day 
with operations.

That is, one of the issues with containerized NiFi is what to do with the flow 
configuration itself. On the one hand, you kind of want to "burn in" your flow 
configuration into your docker image. e.g. the flow.xml.gz and/or flow.json.gz 
would be included as part of your image itself. This enables your NiFi system 
to come up with a fully configured set of processors ready to accept 
connections.

But part of the fun with NiFi is being able to make dataflow and processor 
configuration changes on the fly as needed based on operational conditions. For 
example, maybe you need to temporarily stop data moving to one location and 
have it transported to another. This "live" and dynamic way to manage NiFi is a 
powerful feature, but it kind of goes against the grain of a containerized or 
static deployment approach. e.g. new nodes coming online will not necessarily 
have the latest configuration changes that your operational staff has added 
recently. The NiFi registry can somewhat help here.

Finally to give a shout out, you may want to consider using a dockerized minifi 
cluster instead of traditional NiFi. Minifi is maybe slightly more aligned with 
a containerized clustering approach as Minifi more directly supports this 
concept of a "burned in" processor configuration. In this way, Minifi nodes can 
be spun up or down based on demand, without too much fuss.e.g. minifi isn't 
really cluster aware and each node acts independently, making it a bit easier 
solution for containerized or dynamic deployments.

Hope this gives you some thoughts. There are definitely a lot of recipes and 
approaches to containerized NiFi, so do some searching to find one that matches 
what you're after. Almost any configuration can be done, based on your needs.

/Adam



On Fri, Jan 27, 2023 at 3:15 AM Isha Lamboo 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

I’m looking for some perspectives from people using NiFi deployed in containers 
(Docker or otherwise).

It seems to me that the NiFi architecture benefits from having a lot of compute 
resources to share for all flows, especially with large batches arriving 
periodically. On the other hand, it’s hard to prevent badly tuned flows from 
impacting others and more and more IT operations are moving to containerized 
environments, so I’m exploring the options for containerized NiFi as an 
alternative to our current VM-based approach.

Do you deploy a few large containers similar in capacity to a VM to run all 
flows together or many small ones with only a few flows on each? And do you 
deploy them clustered or standalone?

Thanks,

Isha

Reply via email to