Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files

2019-01-04 Thread Mark Payne
Josef,

OK, thanks for confirming. My suspicion is that the Load-Balancing bug is what 
is biting you, and that
when you tried to replicate with the CompressContent in a simple case, you may 
have just been experiencing
the "cleanup lag" related to the way that the repositories interact with one 
another.

Custom Processors should not be an issue. You should not be able to cause any 
FlowFile to stay in the Repository.

Thanks
-Mark


On Jan 4, 2019, at 11:48 AM, 
mailto:josef.zahn...@swisscom.com>> 
mailto:josef.zahn...@swisscom.com>> wrote:

Mark,

Yes we are using Load Balancing capability and we do that after the ListSFTP 
processor, so yes we loadbalance 0-Byte files. Seems that we probably facing 
your Bug here.

Thanks a lot for explaining in detail what happens regarding the 
flowfile/content repo in NiFi.

Additionally we have several custom processors, could be as well that one of 
them causing it? Can someone share a (java) codesnipplet which ensures that a 
custom processor doesn’t keep the flowfiles in content repo?

Cheers Josef

From: Mark Payne mailto:marka...@hotmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 14:48
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up 
content_repo insanly fast by corrupt GZIP files

Josef,

Thanks for the info! There are a few things to consider here. Firstly, you said 
that you are using NiFi 1.8.0.
Are you using the new Load Balancing capability? I.e., do you have any 
Connections configured to balance
load across your cluster? And if so, are you load-balancing any 0-byte files? 
If so, then you may be getting
bitten by [1]. That can result in data staying in the Content Repo and not 
getting cleaned up until restart.

The second thing that is important to consider is the interaction between the 
FlowFile Repositories and Content
Repository. At a high level, the Content Repository stores the FlowFiles' 
content/payload. The FlowFile Repository
stores the FlowFiles' attributes, which queue it is in, and some other 
metadata. Once a FlowFile completes its processing
and is no longer part of the flow, we cannot simply delete the content claim 
from the Content Repository. If we did so,
we could have a condition where the node is restarted and the FlowFile 
Repository has not yet been fully flushed to disk
(NiFi may have already written to the file, but the Operating System may be 
caching that without having flushed/"fsync'ed"
to disk). In such a case, we want the transaction to be "rolled back" and 
reprocessed. So, if we deleted the Content Claim
from the Content Repository immediately when it is no longer needed, and then 
restarted, we could have a case where the
FlowFile repo wasn't flushed to disk and as a result points to a Content Claim 
that has been deleted, and this would result
in data loss.

So, to avoid the above scenario, what we do is instead keep track of how many 
"claims" there are for a Content Claim
and then, when the FlowFile repo performs a checkpoint (every 2 minutes by 
default), we go through and delete any Content
Claims that have a claim count of 0. So this means that any Content Claim that 
has been accessed in the past 2 minutes
(or however long the checkpoint time is) will be considered "active" and will 
not be cleaned up.

I hope this helps to explain some of the behavior, but if not, let's please 
investigate further!

Thanks
-Mark



[1] https://issues.apache.org/jira/browse/NIFI-5771



On Jan 4, 2019, at 7:41 AM, 
mailto:josef.zahn...@swisscom.com>> 
mailto:josef.zahn...@swisscom.com>> wrote:

Hi Joe

We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the 
partitions below.

[nifi@nifi-12 ~]$ df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/disk1-root 100G  2.0G   99G   2% /
devtmpfs   126G 0  126G   0% /dev
tmpfs  126G 0  126G   0% /dev/shm
tmpfs  126G  3.1G  123G   3% /run
tmpfs  126G 0  126G   0% /sys/fs/cgroup
/dev/sda1 1014M  188M  827M  19% /boot
/dev/mapper/disk1-home  30G   34M   30G   1% /home
/dev/mapper/disk1-var  100G  1.1G   99G   2% /var
/dev/mapper/disk1-opt   50G  5.9G   45G  12% /opt
/dev/mapper/disk1-database_repo   1014M   35M  980M   4% /database_repo
/dev/mapper/disk1-provenance_repo  4.0G   33M  4.0G   1% /provenance_repo
/dev/mapper/disk1-flowfile_repo530G   34M  530G   1% /flowfile_repo
/dev/mapper/disk2-content_repo 850G   64G  786G   8% /content_repo
tmpfs   26G 0   26G   0% /r

Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files

2019-01-04 Thread Josef.Zahner1
Mark,

Yes we are using Load Balancing capability and we do that after the ListSFTP 
processor, so yes we loadbalance 0-Byte files. Seems that we probably facing 
your Bug here.

Thanks a lot for explaining in detail what happens regarding the 
flowfile/content repo in NiFi.

Additionally we have several custom processors, could be as well that one of 
them causing it? Can someone share a (java) codesnipplet which ensures that a 
custom processor doesn’t keep the flowfiles in content repo?

Cheers Josef

From: Mark Payne 
Reply-To: "users@nifi.apache.org" 
Date: Friday, 4 January 2019 at 14:48
To: "users@nifi.apache.org" 
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up 
content_repo insanly fast by corrupt GZIP files

Josef,

Thanks for the info! There are a few things to consider here. Firstly, you said 
that you are using NiFi 1.8.0.
Are you using the new Load Balancing capability? I.e., do you have any 
Connections configured to balance
load across your cluster? And if so, are you load-balancing any 0-byte files? 
If so, then you may be getting
bitten by [1]. That can result in data staying in the Content Repo and not 
getting cleaned up until restart.

The second thing that is important to consider is the interaction between the 
FlowFile Repositories and Content
Repository. At a high level, the Content Repository stores the FlowFiles' 
content/payload. The FlowFile Repository
stores the FlowFiles' attributes, which queue it is in, and some other 
metadata. Once a FlowFile completes its processing
and is no longer part of the flow, we cannot simply delete the content claim 
from the Content Repository. If we did so,
we could have a condition where the node is restarted and the FlowFile 
Repository has not yet been fully flushed to disk
(NiFi may have already written to the file, but the Operating System may be 
caching that without having flushed/"fsync'ed"
to disk). In such a case, we want the transaction to be "rolled back" and 
reprocessed. So, if we deleted the Content Claim
from the Content Repository immediately when it is no longer needed, and then 
restarted, we could have a case where the
FlowFile repo wasn't flushed to disk and as a result points to a Content Claim 
that has been deleted, and this would result
in data loss.

So, to avoid the above scenario, what we do is instead keep track of how many 
"claims" there are for a Content Claim
and then, when the FlowFile repo performs a checkpoint (every 2 minutes by 
default), we go through and delete any Content
Claims that have a claim count of 0. So this means that any Content Claim that 
has been accessed in the past 2 minutes
(or however long the checkpoint time is) will be considered "active" and will 
not be cleaned up.

I hope this helps to explain some of the behavior, but if not, let's please 
investigate further!

Thanks
-Mark



[1] https://issues.apache.org/jira/browse/NIFI-5771



On Jan 4, 2019, at 7:41 AM, 
mailto:josef.zahn...@swisscom.com>> 
mailto:josef.zahn...@swisscom.com>> wrote:

Hi Joe

We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the 
partitions below.

[nifi@nifi-12 ~]$ df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/disk1-root 100G  2.0G   99G   2% /
devtmpfs   126G 0  126G   0% /dev
tmpfs  126G 0  126G   0% /dev/shm
tmpfs  126G  3.1G  123G   3% /run
tmpfs  126G 0  126G   0% /sys/fs/cgroup
/dev/sda1 1014M  188M  827M  19% /boot
/dev/mapper/disk1-home  30G   34M   30G   1% /home
/dev/mapper/disk1-var  100G  1.1G   99G   2% /var
/dev/mapper/disk1-opt   50G  5.9G   45G  12% /opt
/dev/mapper/disk1-database_repo   1014M   35M  980M   4% /database_repo
/dev/mapper/disk1-provenance_repo  4.0G   33M  4.0G   1% /provenance_repo
/dev/mapper/disk1-flowfile_repo530G   34M  530G   1% /flowfile_repo
/dev/mapper/disk2-content_repo 850G   64G  786G   8% /content_repo
tmpfs   26G 0   26G   0% /run/user/2000


Cheers Josef


From: Joe Witt mailto:joe.w...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 13:29
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up 
content_repo insanly fast by corrupt GZIP files

Josef

Not looping for that proc for sure makes sense.  Nifi dying in the middle of a 
process/transaction is no problem..it will restart the transaction.

But we do need to find out what is filling the repo.  You have flowfile, 
content, and prov in different disk volumes or par

Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files

2019-01-04 Thread Mark Payne
Josef,

Thanks for the info! There are a few things to consider here. Firstly, you said 
that you are using NiFi 1.8.0.
Are you using the new Load Balancing capability? I.e., do you have any 
Connections configured to balance
load across your cluster? And if so, are you load-balancing any 0-byte files? 
If so, then you may be getting
bitten by [1]. That can result in data staying in the Content Repo and not 
getting cleaned up until restart.

The second thing that is important to consider is the interaction between the 
FlowFile Repositories and Content
Repository. At a high level, the Content Repository stores the FlowFiles' 
content/payload. The FlowFile Repository
stores the FlowFiles' attributes, which queue it is in, and some other 
metadata. Once a FlowFile completes its processing
and is no longer part of the flow, we cannot simply delete the content claim 
from the Content Repository. If we did so,
we could have a condition where the node is restarted and the FlowFile 
Repository has not yet been fully flushed to disk
(NiFi may have already written to the file, but the Operating System may be 
caching that without having flushed/"fsync'ed"
to disk). In such a case, we want the transaction to be "rolled back" and 
reprocessed. So, if we deleted the Content Claim
from the Content Repository immediately when it is no longer needed, and then 
restarted, we could have a case where the
FlowFile repo wasn't flushed to disk and as a result points to a Content Claim 
that has been deleted, and this would result
in data loss.

So, to avoid the above scenario, what we do is instead keep track of how many 
"claims" there are for a Content Claim
and then, when the FlowFile repo performs a checkpoint (every 2 minutes by 
default), we go through and delete any Content
Claims that have a claim count of 0. So this means that any Content Claim that 
has been accessed in the past 2 minutes
(or however long the checkpoint time is) will be considered "active" and will 
not be cleaned up.

I hope this helps to explain some of the behavior, but if not, let's please 
investigate further!

Thanks
-Mark



[1] https://issues.apache.org/jira/browse/NIFI-5771


On Jan 4, 2019, at 7:41 AM, 
mailto:josef.zahn...@swisscom.com>> 
mailto:josef.zahn...@swisscom.com>> wrote:

Hi Joe

We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the 
partitions below.

[nifi@nifi-12 ~]$ df -h
Filesystem Size  Used Avail Use% Mounted on
/dev/mapper/disk1-root 100G  2.0G   99G   2% /
devtmpfs   126G 0  126G   0% /dev
tmpfs  126G 0  126G   0% /dev/shm
tmpfs  126G  3.1G  123G   3% /run
tmpfs  126G 0  126G   0% /sys/fs/cgroup
/dev/sda1 1014M  188M  827M  19% /boot
/dev/mapper/disk1-home  30G   34M   30G   1% /home
/dev/mapper/disk1-var  100G  1.1G   99G   2% /var
/dev/mapper/disk1-opt   50G  5.9G   45G  12% /opt
/dev/mapper/disk1-database_repo   1014M   35M  980M   4% /database_repo
/dev/mapper/disk1-provenance_repo  4.0G   33M  4.0G   1% /provenance_repo
/dev/mapper/disk1-flowfile_repo530G   34M  530G   1% /flowfile_repo
/dev/mapper/disk2-content_repo 850G   64G  786G   8% /content_repo
tmpfs   26G 0   26G   0% /run/user/2000


Cheers Josef


From: Joe Witt mailto:joe.w...@gmail.com>>
Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Date: Friday, 4 January 2019 at 13:29
To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" 
mailto:users@nifi.apache.org>>
Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up 
content_repo insanly fast by corrupt GZIP files

Josef

Not looping for that proc for sure makes sense.  Nifi dying in the middle of a 
process/transaction is no problem..it will restart the transaction.

But we do need to find out what is filling the repo.  You have flowfile, 
content, and prov in different disk volumes or partitins right?  What version 
of nifi?

Lets definitely figure this out.  You should see clean behavior of the repos 
and you should never have to restart.

thanks

On Fri, Jan 4, 2019, 7:16 AM Mike Thomsen 
mailto:mikerthom...@gmail.com> wrote:
I agree with Pierre's take on the failure relationship. Corrupted compressed 
files are also going to be nearly impossible to recover in most cases, so your 
best bet is to simply log the file name and other relevant attributes and 
establish a process to notify the source system that they sent you corrupt data.

On Fri, Jan 4, 2019 at 6:48 AM 
mailto:josef.zahn...@swisscom.com>> wrote:
Hi Arpad

I’m doing it (hopefully) gracefully:

  *   /opt/nifi/bin/nifi.sh stop
  *   /opt/nifi/bin/nifi.sh restart


But what I se