If a processor uses the session to take a flow file from the incoming
queues, and then nifi crashes before session.commit is called, then
that flow file will be back in the original queue when nifi starts
again since the session never updated the repositories.

So it is possible that a destination processor obtains a flow file,
starts sending data to the destination system, and then nifi crashes,
which means the session didn't get committed and the flow file will be
back in the incoming queue.

The ideal way to solve this is that the destination system offers some
type of transaction, such that no data would be actually made
available in the destination system until committing that transaction,
and then immediately committing the nifi session, which would make it
very unlikely for nifi to crash between those exact two lines of code.

PutParquet is really using HDFS client which doesn't really have a
transaction concept for multiple files, and Kudu seems like it has
some transaction ability but only in limited scenarios.

On Wed, Mar 17, 2021 at 8:55 AM <[email protected]> wrote:
>
> I’m just jumping in, we are seeing this issue as well when we are restarting 
> the nifi process from time.
>
>
>
> We are aware of the nifi.properties 
> “nifi.flowcontroller.graceful.shutdown.period=10 sec” parameter, but to be 
> honest we didn’t try to raise it up yet. Maybe it takes more than 10s to 
> fully execute the PutKudu, I really don’t know.
>
>
>
> Cheers Josef
>
>
>
>
>
>
>
>
>
> From: Vibhath Ileperuma <[email protected]>
> Reply to: "[email protected]" <[email protected]>
> Date: Wednesday, 17 March 2021 at 13:49
> To: "[email protected]" <[email protected]>
> Subject: Re: Data duplication When NIFI is restarted
>
>
>
> Hi Pierre,
>
>
>
> The NIFI flow I'm implementing can be run for a long time continuously(maybe 
> a couple of weeks/months). During this time period it can be terminated due 
> to memory issue or some other system issue, can't it be? In such a case, I 
> may need to restart NIFi manually and run the flow from where it stopped.
>
> Thanks & Regards
>
> Vibhath Ileperuma
>
>
>
>
>
>
>
> On Wed, Mar 17, 2021 at 5:51 PM Pierre Villard <[email protected]> 
> wrote:
>
> Hi Vibhath,
>
>
>
> How is NiFi terminated / restarted ?
>
>
>
> Thanks,
>
> Pierre
>
>
>
> Le mer. 17 mars 2021 à 15:04, Vibhath Ileperuma <[email protected]> 
> a écrit :
>
> Hi all,
>
>
>
> I notice that, if the NIFI instance gets terminated while a processor is 
> processing a flow file, that processor starts to process the flow file again 
> from the beginning when NIFI is restarted.
>
> I'm using the PutKudu processor and the PutParquet processor to write data 
> into kudu and parquet format. Due to the above behaviour,
>
> PutKudu shows primary key violation errors in a restart. I'm using INSERT 
> operation and I can't use INSERT_IGNORE or UPSERT operations since I need to 
> be notified if incoming data has duplicates.
> Since I need to write data in a single flow file into multiple parquet 
> files(by specifying the row group size) It is possible for PutParquet 
> processor to to generate multiple parquet  files with the same content in a 
> restart (data can be duplicated)
>
> I would be grateful if you could suggest a way to overcome this problem.
>
> Thanks & Regards
>
> Vibhath Ileperuma

Reply via email to