Hi there, At the moment, I'm wondering whether NiFi provides end-to-end exactly once semantics in a place where it should in general be possible (speaking not in terms of NiFi but from a technical viewpoint). My example: Reading a file from filesystem and write the contents of that file to Kafka. Apache Flink for instance can provide this guarantee when having checkpoints enabled and setting up the producer in an exactly once mode (i.e. enable transactions)
With regards to NiFi, I read some unclear statements: - In the NiFi docs [1], I found nothing with regards to exactly-once but only a statement letting me decude at-least-once delivery: "A core philosophy of NiFi has been that even at very high scale, guaranteed delivery is a must. This is achieved through effective use of a purpose-built persistent write-ahead log and content repository." - In a NiFi crash course from 2018 by HortonWorks [2], I found a slide with challenges for a DataFlow systems [at time 08:43] where the ""Exactly once" delivery" problem is mentioned twice. As this is a crash course of and advertising NiFi, I think I can/should deduce from this slide that NiFi addresses and solves that exactly-once challenge for me. - In the book "Practical Real-time Data Processing and Analytics" from 2017 by Shilpi Saxena und Saurabh Gupta, it is written: "Let's start with NiFi. It is a guaranteed delivery processing engine (exactly once) by default, which maintains write-ahead logs and a content repository to achieve this." - In [3], back in 2016 where Kafka didn't yet released version 0.11 which allowed for the first time end-to-end exactly-once semantics in producing, Mark Payne as a lead developer of NiFi wrote that " NiFi generally will guarantee At Least Once delivery of your data ", which is in sync with the statements of the current NiFi docs I'd say. But he also wrote that in general, exactly once semantics can be achieved for a distributed system if the source is replayable and the sink can go along with that (he explicitly mentions deduplication, but I think from a todays perspective, a sink supporting a two phase commit (like kafka) should in general work as well as Flink demonstrates). - In [4], in 2017 some community user wrote a knowledge article where he explicitly mentions Two-Phase-Commits in NiFi for Kafka and NiFi Site-to-Site communications which according to his words still provide only at-least-once semantics (But very robust ones, "close" to exactly-once). - A few days ago [5], Mark Payne created a JIRA issue for a new feature implemented in upcoming NiFi 1.15 which will allow "Exactly Once Semantics" (EOS) for stateless pipelines e.g. from Kafka to Kafka in NiFi stateless mode. In the introduction of that story, he talks about "While there are benefits to being able to do so [Implementing exactly once in standard NiFi], the requirements that Kafka puts forth don't really work well with NiFi's architecture.". So from that statement, I'd deduce that the current NiFi doesn't have exactly once semantics and that it sadly doesn't fit well into NiFi architecture. In summary, it seems that general NiFi (not upcoming 1.15 stateless) doesn't support (End-to-End) Exactly-Once-Semantics, am I right? Why doesn't it fit well to the architecutre of NiFi? From my limited understanding of NiFi (I am currently evaluating it playing around with it for the first time), it should in general be possible. Each processor works with transactions which can be rolled backed or committed and being persisted to a write ahead log. So if we don't face a hardware disk failure losing the hard drive NiFi is running on, exactly-once semantics should be possible between from and to each processor and hence, with transactions/two-phase-commit in the kafka producer, fully end to end exactly once (semantics)?! Best regards Theo [1] [ https://nifi.apache.org/docs/nifi-docs/html/overview.html | https://nifi.apache.org/docs/nifi-docs/html/overview.html ] [2] [ https://www.youtube.com/watch?v=fblkgr1PJ0o | https://www.youtube.com/watch?v=fblkgr1PJ0o ] [3] [ https://community.cloudera.com/t5/Support-Questions/Can-nifi-promise-each-of-the-flowfiles-can-be-processed/m-p/141655 | https://community.cloudera.com/t5/Support-Questions/Can-nifi-promise-each-of-the-flowfiles-can-be-processed/m-p/141655 ] [4] [ https://community.cloudera.com/t5/Community-Articles/At-least-once-delivery-vs-exactly-once-delivery-semantics-in/ta-p/244688 | https://community.cloudera.com/t5/Community-Articles/At-least-once-delivery-vs-exactly-once-delivery-semantics-in/ta-p/244688 ] [5] [ https://issues.apache.org/jira/browse/NIFI-9239 | https://issues.apache.org/jira/browse/NIFI-9239 ]