If a processor uses the session to take a flow file from the incoming queues, and then nifi crashes before session.commit is called, then that flow file will be back in the original queue when nifi starts again since the session never updated the repositories.
So it is possible that a destination processor obtains a flow file, starts sending data to the destination system, and then nifi crashes, which means the session didn't get committed and the flow file will be back in the incoming queue. The ideal way to solve this is that the destination system offers some type of transaction, such that no data would be actually made available in the destination system until committing that transaction, and then immediately committing the nifi session, which would make it very unlikely for nifi to crash between those exact two lines of code. PutParquet is really using HDFS client which doesn't really have a transaction concept for multiple files, and Kudu seems like it has some transaction ability but only in limited scenarios. On Wed, Mar 17, 2021 at 8:55 AM <[email protected]> wrote: > > I’m just jumping in, we are seeing this issue as well when we are restarting > the nifi process from time. > > > > We are aware of the nifi.properties > “nifi.flowcontroller.graceful.shutdown.period=10 sec” parameter, but to be > honest we didn’t try to raise it up yet. Maybe it takes more than 10s to > fully execute the PutKudu, I really don’t know. > > > > Cheers Josef > > > > > > > > > > From: Vibhath Ileperuma <[email protected]> > Reply to: "[email protected]" <[email protected]> > Date: Wednesday, 17 March 2021 at 13:49 > To: "[email protected]" <[email protected]> > Subject: Re: Data duplication When NIFI is restarted > > > > Hi Pierre, > > > > The NIFI flow I'm implementing can be run for a long time continuously(maybe > a couple of weeks/months). During this time period it can be terminated due > to memory issue or some other system issue, can't it be? In such a case, I > may need to restart NIFi manually and run the flow from where it stopped. > > Thanks & Regards > > Vibhath Ileperuma > > > > > > > > On Wed, Mar 17, 2021 at 5:51 PM Pierre Villard <[email protected]> > wrote: > > Hi Vibhath, > > > > How is NiFi terminated / restarted ? > > > > Thanks, > > Pierre > > > > Le mer. 17 mars 2021 à 15:04, Vibhath Ileperuma <[email protected]> > a écrit : > > Hi all, > > > > I notice that, if the NIFI instance gets terminated while a processor is > processing a flow file, that processor starts to process the flow file again > from the beginning when NIFI is restarted. > > I'm using the PutKudu processor and the PutParquet processor to write data > into kudu and parquet format. Due to the above behaviour, > > PutKudu shows primary key violation errors in a restart. I'm using INSERT > operation and I can't use INSERT_IGNORE or UPSERT operations since I need to > be notified if incoming data has duplicates. > Since I need to write data in a single flow file into multiple parquet > files(by specifying the row group size) It is possible for PutParquet > processor to to generate multiple parquet files with the same content in a > restart (data can be duplicated) > > I would be grateful if you could suggest a way to overcome this problem. > > Thanks & Regards > > Vibhath Ileperuma
