Julian, It is a great question. If you have not already read them the overview [1] and life of a flow file [2] docs will probably help orient things. One is really high level and the other is much lower level so they might not provide enough clarity.
For your question about what happens to data while it in NiFi's control the in depth doc [2] is key to follow. I'm going to provide a sort of 'architecture stack' for IoT systems from a sort of end-to-end view and explain where NiFi fits in there in terms of sweet spot and overlap and will try to do so from Edge through Core/Cloud. Edge: - Device/Sensor - Data Processing (simple, complex *) - Data Routing - Gateway * When I say simple/complex here i mean it in the sort of classical sense of simple event processing versus complex event processing. These terms have become problematic and seem to have fallen out of favor. But for my purposes here in this email I mean to suggest the difference really is whether a discrete object/event can be operated on in its own right based on its data and the context/configuration of the system and the confirmation of the flow itself (think of a flow like a finite state machine where being at some state implies prior context). Whereas complex means a given object/data/event is operated on against some sort of rolling window of knowledge/state/observation to support cool things like temporal/spatial correlation, etc.. There are probably better definitions out there and indeed the above is more of a continuum than a yes/no thing. In your examples you ask: "how long was this bit set" "notify me when this signal is below a certain threshold for more than 30s" Both of those require retaining and tracking state apart from the event itself. If we consider that a sensor reading is an event what we want to know how across a series of events for how long a given reading was consistent/when it changed, etc.. then we have some sort of database where we keep this information and indeed when the state changes we want to fire/detect that as an event itself. For the 30 other case we want to have events generated based on that same sort of knowledge but in this case with an added time specific trigger. They're both of the same variety. Regional DataCenter (Cloud, On-Prem, etc..) - Device Management - Messaging - Data Flow Integration Core Data Center (Cloud, On-Prem, etc..) - Messaging - Data Flow Integration - Global/Centralized Command and Control of DataFlows - Data Processing (stream/batch/etc..) - System Orchestration NiFi as a project fundamentally was designed to tackle the end (where data is created) to end (all points of consumption) flow management of data. The fundamental understanding is that systems which generate data and systems which consume data are often not designed to talk to eachother in advance and even if they were there are important separations of concerns to consider and plan for and enforce to have a healthy architecture. I talk about this part quite a bit in [3]. In short though, for two systems to talk reliably there are many things which have to agree - always. Some of them are format, schema, relevance, priority, size, rate, etc.. A related set of problems then is how to manage that for end to end systems the components within it come and go and get moved around and get upgraded, and use different security, etc... I talk about messaging versus data flow management in the oscon talk. Now, having sort of established (this is a much longer topic) the point of dataflow management, let me distinguish this from other systems like Camel/etc.. and then processing systems. NiFi differs from most systems in this dataflow management space in several important ways. One key way is that it takes the responsibility for the safety of data and is designed for handling tiny objects (a few bytes) and large objects (several GB) at once. It exposes a streaming API to extension writers that never forces byte[] usage which is usually what makes other systems difficult to use and scale in this space. It provides built-in data provenance capabilities which make tracking the origin and attribution of data in an end-to-end sense almost a solvable problem :) (among other cool things). It provides the ability to see in real-time what is happening flowing in the system and to interactively modify a running flow impacting only the parts of the flow under change. It supports complex directed graphs of processing. Now with the NiFi Registry it supports a powerful SDLC model for CI/CD style work across environments and still maintains the previous comments about how it efficiently impacts live flows/change-sets. Also, NiFi strongly differs in that you don't put 'nifi into your application'. People do that with Camel all the time as it helps them give their applications dataflow capabilities. With NiFi it is a central data broker in and of itself. You run NiFi as an application/cluster/etc.. and use it to capture/source data from producers using the protocols those systems were built for and in either a push/pull sourcing mode. You route/transform data as needed both within NiFi or by making service calls. And finally you use NiFi to get data to consuming systems again using their protocol of choice and again in either a push or pull fashion. NiFi handles the safety of the data using its repositories and transactional behavior each exchange point. Tons of power in that. NiFi is used all the time for 'data processing'. This is often referred to as 'Transformations'. If you talk to stream processing people they'll say processing is about windowing and any system that doesn't have that isn't a stream processing system. Well, NiFi doesn't have that but its used for all kinds of processing/transformations all the time. If you talk to people that do 'ETL' they'll say transformations are all about relational transforms and if NiFi doesn't have those or make those easy then it isn't an ETL system. Well, NiFi is used in ETL all the time and it doesn't have those things. The point I'm trying to make is that NiFi is designed for routing, transformation, and mediation of data between systems. Routing/Mediation are obvious easy so lets stay on transformation. What types of things is NiFi often used for in this realm? These would be things like enrichment of events, format/schema transformation, filtering, aggregation where the intent is to combine a series of events together into a larger event, not so much aggregation where the intent is to look at a series of events and produce a single 'aggregate/summary' though on occasion that too, splitting where we'd take a series of events and split them apart, etc. That comment there about aggregation is a key one. You want what some of the batch/streaming processing systems do for those things. Why not have all that in a single system? The actors, needs, and resulting APIs and user experiences tend to be skewed toward data processing and data flow management as different things. I see projects/companies all the time trying to use one for the other. So, my advice there is be careful. Make sure you use the one you need or use pick a tool for each category and use them both together and let them do their part well. So, all that rambling now let me answer your fundamental ask which was "What I am looking for is a framework which does some analysis of data streams coming from controllers" NiFi is less about the 'analysis' of data streams and more about the management (capture, transformation, mediation) of those data streams. Have NiFi be what talks with the devices/gateway, messaging systems, etc. and feeds data to/through the analytic/processing system whatever that may be. You can also use NiFi for the analytic execution in itself as well but it won't be 'as good' at that. Some quick comments to round this out: 1) MiNiFi is for the first-mile edge collection problem. It is designed to live directly on a sensor/device and failing that then live on the nearby gateway to act as a data flow system. It would relay data to NiFi, or via MQTT, or to a Kafka topic in some other location. It's about having a vast distribution of data flow agents each operating independently but under some centralized asynchronous command and control model. 2) NiFi is for regional/core datacenter data flow management. It supports clustering, live/interactive flow management, etc.. 3) The NiFi Registry is a centralized registry to store version controlled flows which can be instantiated as many times as needed in a single cluster, environment, etc.. It serves as a store for well designed/tested/certified/compliance approved flows or as a tool to migrate from dev to staging to prod or to replicate prod in another environment for quick troubleshooting, etc.. NiFi can be used for service activator. If that is your primary interest you might also want to look at Apache Airflow. That project seems to focus on system orchestration moreso than data flow management. At first blush they appear similar but in reality they solve really different problems. [1] https://nifi.apache.org/docs/nifi-docs/html/overview.html [2] https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html [3] https://www.youtube.com/watch?v=sQCgtCoZyFQ Thanks Joe On Mon, Oct 1, 2018 at 9:06 AM Otto Fowler <[email protected]> wrote: > > I think you might want to look at Apache Metron, and it’s profile and > alerting capabilities if you are going to look for SIEM like things. > > > > On September 30, 2018 at 07:58:10, Julian Feinauer > ([email protected]) wrote: > > Hi Nifi-team, > > > > I’m from the incubating plc4x project [1] and I am looking for a framework > which is suitable for the management of IoT Datastreams and do some edge > computing. > > As nifi is often times mentioned in relation with IoT I tried to find out > what nifi realy does and how it would fit with our ideas (and also the MiNiFi > Project seems to fit into this). > > > > From what I understood from the Docs and some Videos NiFi looks for me a bit > like Apache Camel [2] as it is able to (dynamically) integrate different > systems and manage the dataflow between them. So what I did not get exactly I > how the payloads are managed between these Endpoints and how much of > processing Nifi does itself and how much it delegates to other components > (like e.g. Service Activater in EIP). > > > > What I am looking for is a framework which does some analysis of data streams > coming from controllers that, e.g., control machines or robots. chrisdutz > already prepared the first version of an NiFi Endpoint in th Plc4x Repo so we > are already able to stream these datasets to NiFi. Whats unclear to me is how > we could tackle some of the questions like “how long was this bit set” or > “notify me when this signal is below a certain threshold for more than 30s” > or so. > > Is this in the scope of NiFi or is NiFi more of an integration / data-flow > layer which is absolutely agnostic of these processing blocks? > > > > I hope my questions are not too dumb or I’m not missing NiFis core too much > with my current knowledge. > > I would be happy for some answers or some ideas about how to approach the > questions stated above by some experienced users. > > > > Best > > Julian > > > > [1] http://plc4x.incubator.apache.org/ > > [2] https://camel.apache.org/ > > [3] > https://github.com/apache/incubator-plc4x/tree/master/integrations/apache-nifi
