Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files
Josef, OK, thanks for confirming. My suspicion is that the Load-Balancing bug is what is biting you, and that when you tried to replicate with the CompressContent in a simple case, you may have just been experiencing the "cleanup lag" related to the way that the repositories interact with one another. Custom Processors should not be an issue. You should not be able to cause any FlowFile to stay in the Repository. Thanks -Mark On Jan 4, 2019, at 11:48 AM, mailto:josef.zahn...@swisscom.com>> mailto:josef.zahn...@swisscom.com>> wrote: Mark, Yes we are using Load Balancing capability and we do that after the ListSFTP processor, so yes we loadbalance 0-Byte files. Seems that we probably facing your Bug here. Thanks a lot for explaining in detail what happens regarding the flowfile/content repo in NiFi. Additionally we have several custom processors, could be as well that one of them causing it? Can someone share a (java) codesnipplet which ensures that a custom processor doesn’t keep the flowfiles in content repo? Cheers Josef From: Mark Payne mailto:marka...@hotmail.com>> Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" mailto:users@nifi.apache.org>> Date: Friday, 4 January 2019 at 14:48 To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" mailto:users@nifi.apache.org>> Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files Josef, Thanks for the info! There are a few things to consider here. Firstly, you said that you are using NiFi 1.8.0. Are you using the new Load Balancing capability? I.e., do you have any Connections configured to balance load across your cluster? And if so, are you load-balancing any 0-byte files? If so, then you may be getting bitten by [1]. That can result in data staying in the Content Repo and not getting cleaned up until restart. The second thing that is important to consider is the interaction between the FlowFile Repositories and Content Repository. At a high level, the Content Repository stores the FlowFiles' content/payload. The FlowFile Repository stores the FlowFiles' attributes, which queue it is in, and some other metadata. Once a FlowFile completes its processing and is no longer part of the flow, we cannot simply delete the content claim from the Content Repository. If we did so, we could have a condition where the node is restarted and the FlowFile Repository has not yet been fully flushed to disk (NiFi may have already written to the file, but the Operating System may be caching that without having flushed/"fsync'ed" to disk). In such a case, we want the transaction to be "rolled back" and reprocessed. So, if we deleted the Content Claim from the Content Repository immediately when it is no longer needed, and then restarted, we could have a case where the FlowFile repo wasn't flushed to disk and as a result points to a Content Claim that has been deleted, and this would result in data loss. So, to avoid the above scenario, what we do is instead keep track of how many "claims" there are for a Content Claim and then, when the FlowFile repo performs a checkpoint (every 2 minutes by default), we go through and delete any Content Claims that have a claim count of 0. So this means that any Content Claim that has been accessed in the past 2 minutes (or however long the checkpoint time is) will be considered "active" and will not be cleaned up. I hope this helps to explain some of the behavior, but if not, let's please investigate further! Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-5771 On Jan 4, 2019, at 7:41 AM, mailto:josef.zahn...@swisscom.com>> mailto:josef.zahn...@swisscom.com>> wrote: Hi Joe We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the partitions below. [nifi@nifi-12 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/disk1-root 100G 2.0G 99G 2% / devtmpfs 126G 0 126G 0% /dev tmpfs 126G 0 126G 0% /dev/shm tmpfs 126G 3.1G 123G 3% /run tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/sda1 1014M 188M 827M 19% /boot /dev/mapper/disk1-home 30G 34M 30G 1% /home /dev/mapper/disk1-var 100G 1.1G 99G 2% /var /dev/mapper/disk1-opt 50G 5.9G 45G 12% /opt /dev/mapper/disk1-database_repo 1014M 35M 980M 4% /database_repo /dev/mapper/disk1-provenance_repo 4.0G 33M 4.0G 1% /provenance_repo /dev/mapper/disk1-flowfile_repo530G 34M 530G 1% /flowfile_repo /dev/mapper/disk2-content_repo 850G 64G 786G 8% /content_repo tmpfs 26G 0 26G 0% /r
Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files
Mark, Yes we are using Load Balancing capability and we do that after the ListSFTP processor, so yes we loadbalance 0-Byte files. Seems that we probably facing your Bug here. Thanks a lot for explaining in detail what happens regarding the flowfile/content repo in NiFi. Additionally we have several custom processors, could be as well that one of them causing it? Can someone share a (java) codesnipplet which ensures that a custom processor doesn’t keep the flowfiles in content repo? Cheers Josef From: Mark Payne Reply-To: "users@nifi.apache.org" Date: Friday, 4 January 2019 at 14:48 To: "users@nifi.apache.org" Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files Josef, Thanks for the info! There are a few things to consider here. Firstly, you said that you are using NiFi 1.8.0. Are you using the new Load Balancing capability? I.e., do you have any Connections configured to balance load across your cluster? And if so, are you load-balancing any 0-byte files? If so, then you may be getting bitten by [1]. That can result in data staying in the Content Repo and not getting cleaned up until restart. The second thing that is important to consider is the interaction between the FlowFile Repositories and Content Repository. At a high level, the Content Repository stores the FlowFiles' content/payload. The FlowFile Repository stores the FlowFiles' attributes, which queue it is in, and some other metadata. Once a FlowFile completes its processing and is no longer part of the flow, we cannot simply delete the content claim from the Content Repository. If we did so, we could have a condition where the node is restarted and the FlowFile Repository has not yet been fully flushed to disk (NiFi may have already written to the file, but the Operating System may be caching that without having flushed/"fsync'ed" to disk). In such a case, we want the transaction to be "rolled back" and reprocessed. So, if we deleted the Content Claim from the Content Repository immediately when it is no longer needed, and then restarted, we could have a case where the FlowFile repo wasn't flushed to disk and as a result points to a Content Claim that has been deleted, and this would result in data loss. So, to avoid the above scenario, what we do is instead keep track of how many "claims" there are for a Content Claim and then, when the FlowFile repo performs a checkpoint (every 2 minutes by default), we go through and delete any Content Claims that have a claim count of 0. So this means that any Content Claim that has been accessed in the past 2 minutes (or however long the checkpoint time is) will be considered "active" and will not be cleaned up. I hope this helps to explain some of the behavior, but if not, let's please investigate further! Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-5771 On Jan 4, 2019, at 7:41 AM, mailto:josef.zahn...@swisscom.com>> mailto:josef.zahn...@swisscom.com>> wrote: Hi Joe We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the partitions below. [nifi@nifi-12 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/disk1-root 100G 2.0G 99G 2% / devtmpfs 126G 0 126G 0% /dev tmpfs 126G 0 126G 0% /dev/shm tmpfs 126G 3.1G 123G 3% /run tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/sda1 1014M 188M 827M 19% /boot /dev/mapper/disk1-home 30G 34M 30G 1% /home /dev/mapper/disk1-var 100G 1.1G 99G 2% /var /dev/mapper/disk1-opt 50G 5.9G 45G 12% /opt /dev/mapper/disk1-database_repo 1014M 35M 980M 4% /database_repo /dev/mapper/disk1-provenance_repo 4.0G 33M 4.0G 1% /provenance_repo /dev/mapper/disk1-flowfile_repo530G 34M 530G 1% /flowfile_repo /dev/mapper/disk2-content_repo 850G 64G 786G 8% /content_repo tmpfs 26G 0 26G 0% /run/user/2000 Cheers Josef From: Joe Witt mailto:joe.w...@gmail.com>> Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" mailto:users@nifi.apache.org>> Date: Friday, 4 January 2019 at 13:29 To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" mailto:users@nifi.apache.org>> Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files Josef Not looping for that proc for sure makes sense. Nifi dying in the middle of a process/transaction is no problem..it will restart the transaction. But we do need to find out what is filling the repo. You have flowfile, content, and prov in different disk volumes or par
Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files
Josef, Thanks for the info! There are a few things to consider here. Firstly, you said that you are using NiFi 1.8.0. Are you using the new Load Balancing capability? I.e., do you have any Connections configured to balance load across your cluster? And if so, are you load-balancing any 0-byte files? If so, then you may be getting bitten by [1]. That can result in data staying in the Content Repo and not getting cleaned up until restart. The second thing that is important to consider is the interaction between the FlowFile Repositories and Content Repository. At a high level, the Content Repository stores the FlowFiles' content/payload. The FlowFile Repository stores the FlowFiles' attributes, which queue it is in, and some other metadata. Once a FlowFile completes its processing and is no longer part of the flow, we cannot simply delete the content claim from the Content Repository. If we did so, we could have a condition where the node is restarted and the FlowFile Repository has not yet been fully flushed to disk (NiFi may have already written to the file, but the Operating System may be caching that without having flushed/"fsync'ed" to disk). In such a case, we want the transaction to be "rolled back" and reprocessed. So, if we deleted the Content Claim from the Content Repository immediately when it is no longer needed, and then restarted, we could have a case where the FlowFile repo wasn't flushed to disk and as a result points to a Content Claim that has been deleted, and this would result in data loss. So, to avoid the above scenario, what we do is instead keep track of how many "claims" there are for a Content Claim and then, when the FlowFile repo performs a checkpoint (every 2 minutes by default), we go through and delete any Content Claims that have a claim count of 0. So this means that any Content Claim that has been accessed in the past 2 minutes (or however long the checkpoint time is) will be considered "active" and will not be cleaned up. I hope this helps to explain some of the behavior, but if not, let's please investigate further! Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-5771 On Jan 4, 2019, at 7:41 AM, mailto:josef.zahn...@swisscom.com>> mailto:josef.zahn...@swisscom.com>> wrote: Hi Joe We use NiFi 1.8.0. Yes we have different partitions for each repo. You see the partitions below. [nifi@nifi-12 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/disk1-root 100G 2.0G 99G 2% / devtmpfs 126G 0 126G 0% /dev tmpfs 126G 0 126G 0% /dev/shm tmpfs 126G 3.1G 123G 3% /run tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/sda1 1014M 188M 827M 19% /boot /dev/mapper/disk1-home 30G 34M 30G 1% /home /dev/mapper/disk1-var 100G 1.1G 99G 2% /var /dev/mapper/disk1-opt 50G 5.9G 45G 12% /opt /dev/mapper/disk1-database_repo 1014M 35M 980M 4% /database_repo /dev/mapper/disk1-provenance_repo 4.0G 33M 4.0G 1% /provenance_repo /dev/mapper/disk1-flowfile_repo530G 34M 530G 1% /flowfile_repo /dev/mapper/disk2-content_repo 850G 64G 786G 8% /content_repo tmpfs 26G 0 26G 0% /run/user/2000 Cheers Josef From: Joe Witt mailto:joe.w...@gmail.com>> Reply-To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" mailto:users@nifi.apache.org>> Date: Friday, 4 January 2019 at 13:29 To: "users@nifi.apache.org<mailto:users@nifi.apache.org>" mailto:users@nifi.apache.org>> Subject: Re: NiFi (De-)"Compress Content" Processor causes to fill up content_repo insanly fast by corrupt GZIP files Josef Not looping for that proc for sure makes sense. Nifi dying in the middle of a process/transaction is no problem..it will restart the transaction. But we do need to find out what is filling the repo. You have flowfile, content, and prov in different disk volumes or partitins right? What version of nifi? Lets definitely figure this out. You should see clean behavior of the repos and you should never have to restart. thanks On Fri, Jan 4, 2019, 7:16 AM Mike Thomsen mailto:mikerthom...@gmail.com> wrote: I agree with Pierre's take on the failure relationship. Corrupted compressed files are also going to be nearly impossible to recover in most cases, so your best bet is to simply log the file name and other relevant attributes and establish a process to notify the source system that they sent you corrupt data. On Fri, Jan 4, 2019 at 6:48 AM mailto:josef.zahn...@swisscom.com>> wrote: Hi Arpad I’m doing it (hopefully) gracefully: * /opt/nifi/bin/nifi.sh stop * /opt/nifi/bin/nifi.sh restart But what I se