Julian,

It is a great question.  If you have not already read them the
overview [1] and life of a flow file [2] docs will probably help
orient things.  One is really high level and the other is much lower
level so they might not provide enough clarity.

For your question about what happens to data while it in NiFi's
control the in depth doc [2] is key to follow.

I'm going to provide a sort of 'architecture stack' for IoT systems
from a sort of end-to-end view and explain where NiFi fits in there in
terms of sweet spot and overlap and will try to do so from Edge
through Core/Cloud.

Edge:
- Device/Sensor
- Data Processing (simple, complex *)
- Data Routing
- Gateway

* When I say simple/complex here i mean it in the sort of classical
sense of simple event processing versus complex event processing.
These terms have become problematic and seem to have fallen out of
favor.  But for my purposes here in this email I mean to suggest the
difference really is whether a discrete object/event can be operated
on in its own right based on its data and the context/configuration of
the system and the confirmation of the flow itself (think of a flow
like a finite state machine where being at some state implies prior
context).  Whereas complex means a given object/data/event is operated
on against some sort of rolling window of knowledge/state/observation
to support cool things like temporal/spatial correlation, etc..  There
are probably better definitions out there and indeed the above is more
of a continuum than a yes/no thing.

In your examples you ask:
"how long was this bit set"
"notify me when this signal is below a certain threshold for more than 30s"
Both of those require retaining and tracking state apart from the
event itself.  If we consider that a sensor reading is an event what
we want to know how across a series of events for how long a given
reading was consistent/when it changed, etc.. then we have some sort
of database where we keep this information and indeed when the state
changes we want to fire/detect that as an event itself.  For the 30
other case we want to have events generated based on that same sort of
knowledge but in this case with an added time specific trigger.
They're both of the same variety.

Regional DataCenter (Cloud, On-Prem, etc..)
- Device Management
- Messaging
- Data Flow Integration

Core Data Center (Cloud, On-Prem, etc..)
- Messaging
- Data Flow Integration
- Global/Centralized Command and Control of DataFlows
- Data Processing (stream/batch/etc..)
- System Orchestration

NiFi as a project fundamentally was designed to tackle the end (where
data is created) to end (all points of consumption) flow management of
data.  The fundamental understanding is that systems which generate
data and systems which consume data are often not designed to talk to
eachother in advance and even if they were there are important
separations of concerns to consider and plan for and enforce to have a
healthy architecture.  I talk about this part quite a bit in [3].  In
short though, for two systems to talk reliably there are many things
which have to agree - always.  Some of them are format, schema,
relevance, priority, size, rate, etc..  A related set of problems then
is how to manage that for end to end systems the components within it
come and go and get moved around and get upgraded, and use different
security, etc...  I talk about messaging versus data flow management
in the oscon talk.

Now, having sort of established (this is a much longer topic) the
point of dataflow management, let me distinguish this from other
systems like Camel/etc..  and then processing systems.

NiFi differs from most systems in this dataflow management space in
several important ways.  One key way is that it takes the
responsibility for the safety of data and is designed for handling
tiny objects (a  few bytes) and large objects (several GB) at once.
It exposes a streaming API to extension writers that never forces
byte[] usage which is usually what makes other systems difficult to
use and scale in this space.  It provides built-in data provenance
capabilities which make tracking the origin and attribution of data in
an end-to-end sense almost a solvable problem :) (among other cool
things).  It provides the ability to see in real-time what is
happening flowing in the system and to interactively modify a running
flow impacting only the parts of the flow under change.  It supports
complex directed graphs of processing.  Now with the NiFi Registry it
supports a powerful SDLC model for CI/CD style work across
environments and still maintains the previous comments about how it
efficiently impacts live flows/change-sets.

Also, NiFi strongly differs in that you don't put 'nifi into your
application'.  People do that with Camel all the time as it helps them
give their applications dataflow capabilities.  With NiFi it is a
central data broker in and of itself.  You run NiFi as an
application/cluster/etc.. and use it to capture/source data from
producers using the protocols those systems were built for and in
either a push/pull sourcing mode.  You route/transform data as needed
both within NiFi or by making service calls.  And finally you use NiFi
to get data to consuming systems again using their protocol of choice
and again in either a push or pull fashion.  NiFi handles the safety
of the data using its repositories and transactional behavior each
exchange point.  Tons of power in that.

NiFi is used all the time for 'data processing'.  This is often
referred to as 'Transformations'.  If you talk to stream processing
people they'll say processing is about windowing and any system that
doesn't have that isn't a stream processing system.  Well, NiFi
doesn't have that but its used for all kinds of
processing/transformations all the time.  If you talk to people that
do 'ETL' they'll say transformations are all about relational
transforms and if NiFi doesn't have those or make those easy then it
isn't an ETL system.  Well, NiFi is used in ETL all the time and it
doesn't have those things.  The point I'm trying to make is that NiFi
is designed for routing, transformation, and mediation of data between
systems.  Routing/Mediation are obvious easy so lets stay on
transformation.  What types of things is NiFi often used for in this
realm?  These would be things like enrichment of events, format/schema
transformation, filtering, aggregation where the intent is to combine
a series of events together into a larger event, not so much
aggregation where the intent is to look at a series of events and
produce a single 'aggregate/summary' though on occasion that too,
splitting where we'd take a series of events and split them apart,
etc.  That comment there about aggregation is a key one.  You want
what some of the batch/streaming processing systems do for those
things. Why not have all that in a single system?  The actors, needs,
and resulting APIs and user experiences tend to be skewed toward data
processing and data flow management as different things.  I see
projects/companies all the time trying to use one for the other.  So,
my advice there is be careful.  Make sure you use the one you need or
use pick a tool for each category and use them both together and let
them do their part well.

So, all that rambling now let me answer your fundamental ask which was
"What I am looking for is a framework which does some analysis of data
streams coming from controllers"

NiFi is less about the 'analysis' of data streams and more about the
management (capture, transformation, mediation) of those data streams.
Have NiFi be what talks with the devices/gateway, messaging systems,
etc. and feeds data to/through the analytic/processing system whatever
that may be.  You can also use NiFi for the analytic execution in
itself as well but it won't be 'as good' at that.

Some quick comments to round this out:
1) MiNiFi is for the first-mile edge collection problem.  It is
designed to live directly on a sensor/device and failing that then
live on the nearby gateway to act as a data flow system.  It would
relay data to NiFi, or via MQTT, or to a Kafka topic in some other
location.  It's about having a vast distribution of data flow agents
each operating independently but under some centralized asynchronous
command and control model.
2) NiFi is for regional/core datacenter data flow management.  It
supports clustering, live/interactive flow management, etc..
3) The NiFi Registry is a centralized registry to store version
controlled flows which can be instantiated as many times as needed in
a single cluster, environment, etc..  It serves as a store for well
designed/tested/certified/compliance approved flows or as a tool to
migrate from dev to staging to prod or to replicate prod in another
environment for quick troubleshooting, etc..

NiFi can be used for service activator.  If that is your primary
interest you might also want to look at Apache Airflow.  That project
seems to focus on system orchestration moreso than data flow
management.  At first blush they appear similar but in reality they
solve really different problems.

[1] https://nifi.apache.org/docs/nifi-docs/html/overview.html
[2] https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
[3] https://www.youtube.com/watch?v=sQCgtCoZyFQ

Thanks
Joe
On Mon, Oct 1, 2018 at 9:06 AM Otto Fowler <[email protected]> wrote:
>
> I think you might want to look at Apache Metron, and it’s profile and 
> alerting capabilities if you are going to look for SIEM like things.
>
>
>
> On September 30, 2018 at 07:58:10, Julian Feinauer 
> ([email protected]) wrote:
>
> Hi Nifi-team,
>
>
>
> I’m from the incubating plc4x project [1] and I am looking for a framework 
> which is suitable for the management of IoT Datastreams and do some edge 
> computing.
>
> As nifi is often times mentioned in relation with IoT I tried to find out 
> what nifi realy does and how it would fit with our ideas (and also the MiNiFi 
> Project seems to fit into this).
>
>
>
> From what I understood from the Docs and some Videos NiFi looks for me a bit 
> like Apache Camel [2] as it is able to (dynamically) integrate different 
> systems and manage the dataflow between them. So what I did not get exactly I 
> how the payloads are managed between these Endpoints and how much of 
> processing Nifi does itself and how much it delegates to other components 
> (like e.g. Service Activater in EIP).
>
>
>
> What I am looking for is a framework which does some analysis of data streams 
> coming from controllers that, e.g., control machines or robots. chrisdutz 
> already prepared the first version of an NiFi Endpoint in th Plc4x Repo so we 
> are already able to stream these datasets to NiFi. Whats unclear to me is how 
> we could tackle some of the questions like “how long was this bit set” or 
> “notify me when this signal is below a certain threshold for more than 30s” 
> or so.
>
> Is this in the scope of NiFi or is NiFi more of an integration / data-flow 
> layer which is absolutely agnostic of these processing blocks?
>
>
>
> I hope my questions are not too dumb or I’m not missing NiFis core too much 
> with my current knowledge.
>
> I would be happy for some answers or some ideas about how to approach the 
> questions stated above by some experienced users.
>
>
>
> Best
>
> Julian
>
>
>
> [1] http://plc4x.incubator.apache.org/
>
> [2] https://camel.apache.org/
>
> [3] 
> https://github.com/apache/incubator-plc4x/tree/master/integrations/apache-nifi

Reply via email to