Re: MergeRecord, queue & backpressure
Good afternoon, Another thing to help you out maybe ... You can also tweak the nifi.properties setting: nifi.queue.swap.threshold=2 This setting will control the value of the max flowfile count on a connection if exceeded it will flush those flowfiles to disk. I am not sure however there is a distinction between a "record flowfile" since they live in memory, and the traditional way of thinking of flowfiles. On Fri, Apr 13, 2018 at 10:49 AM Mark Paynewrote: > Aurélien, > > In that case you're looking to merge about 500,000 FlowFiles into a single > FlowFile, so you'll > definitely want to use a cascading approach. I'd shoot for about 1 MB for > the first MergeRecord > and then merge 128 of those together for the second MergeRecord. > > The provenance backpressure is occurring because of the large number of > provenance events being > generated. One even will be generated, more or less, for each time that a > Processor touches a FlowFile. > So if you are merging the FlowFiles together as early as possible, you'll > reduce the load that you're putting > on the Provenance Repository. > > Also, depending on how you're getting the data into your flow, if you're > able, it is best to receive a larger "micro-batch" > of records per flowfile to begin with and not split them up. This would > greatly alleviate the pressure on the Provenance > Repository and avoid needing multiple MergeRecord processors as well. > > Also, of note, there is a newer version of the Provenance Repository that > you can switch to, by changing the > "nifi.provenance.repository.implementation" property in nifi.properties > from "org.apache.nifi.provenance.PersistentProvenanceRepository" > to "org.apache.nifi.provenance.WriteAheadProvenanceRepository". The > Write-Ahead version is quite a bit faster > and behaves differently than the Persistent Provenance Repo, so you won't > see those warnings about provenance > backpressure. > > I hope this helps! > -Mark > > > > > On Apr 13, 2018, at 10:30 AM, DEHAY Aurelien < > aurelien.de...@faurecia.com> wrote: > > > > Hello. > > > > It's me again regarding my mergerecord question. > > > > I still don't manage to have what I want, I may have understand how bin > based processor works, it's for clarification and a question regarding > performance. > > > > I want to merge a huge number of 300 octets flowfiles in 128 MB parquet > file. > > > > My understanding is, for mergerecord to be able to create a bin with > 128MB of data, these data must be in queue. We can't feed the bin "one flow > at a time", so working with small flowfiles, I have to set the backpressure > parameter to something really high, or remove completely the number of > flowfile backpressure limit. > > > > I understood by reading > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeRecord/additionalDetails.html > that it's not the "good" way to do, but I should cascade multiple merge to > "slowly" make the flowfile bigger? > > > > I've made some test with a single level but I hit the "provenance > recording rate". Will multiple level help? > > > > Thanks for any help. > > > > Aurélien. > > > > This electronic transmission (and any attachments thereto) is intended > solely for the use of the addressee(s). It may contain confidential or > legally privileged information. If you are not the intended recipient of > this message, you must delete it immediately and notify the sender. Any > unauthorized use or disclosure of this message is strictly prohibited. > Faurecia does not guarantee the integrity of this transmission and shall > therefore never be liable if the message is altered or falsified nor for > any virus, interception or damage to your system. > > > >
Re: MergeRecord, queue & backpressure
Aurélien, In that case you're looking to merge about 500,000 FlowFiles into a single FlowFile, so you'll definitely want to use a cascading approach. I'd shoot for about 1 MB for the first MergeRecord and then merge 128 of those together for the second MergeRecord. The provenance backpressure is occurring because of the large number of provenance events being generated. One even will be generated, more or less, for each time that a Processor touches a FlowFile. So if you are merging the FlowFiles together as early as possible, you'll reduce the load that you're putting on the Provenance Repository. Also, depending on how you're getting the data into your flow, if you're able, it is best to receive a larger "micro-batch" of records per flowfile to begin with and not split them up. This would greatly alleviate the pressure on the Provenance Repository and avoid needing multiple MergeRecord processors as well. Also, of note, there is a newer version of the Provenance Repository that you can switch to, by changing the "nifi.provenance.repository.implementation" property in nifi.properties from "org.apache.nifi.provenance.PersistentProvenanceRepository" to "org.apache.nifi.provenance.WriteAheadProvenanceRepository". The Write-Ahead version is quite a bit faster and behaves differently than the Persistent Provenance Repo, so you won't see those warnings about provenance backpressure. I hope this helps! -Mark > On Apr 13, 2018, at 10:30 AM, DEHAY Aurelien> wrote: > > Hello. > > It's me again regarding my mergerecord question. > > I still don't manage to have what I want, I may have understand how bin based > processor works, it's for clarification and a question regarding performance. > > I want to merge a huge number of 300 octets flowfiles in 128 MB parquet file. > > My understanding is, for mergerecord to be able to create a bin with 128MB of > data, these data must be in queue. We can't feed the bin "one flow at a > time", so working with small flowfiles, I have to set the backpressure > parameter to something really high, or remove completely the number of > flowfile backpressure limit. > > I understood by reading > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeRecord/additionalDetails.html > that it's not the "good" way to do, but I should cascade multiple merge to > "slowly" make the flowfile bigger? > > I've made some test with a single level but I hit the "provenance recording > rate". Will multiple level help? > > Thanks for any help. > > Aurélien. > > This electronic transmission (and any attachments thereto) is intended solely > for the use of the addressee(s). It may contain confidential or legally > privileged information. If you are not the intended recipient of this > message, you must delete it immediately and notify the sender. Any > unauthorized use or disclosure of this message is strictly prohibited. > Faurecia does not guarantee the integrity of this transmission and shall > therefore never be liable if the message is altered or falsified nor for any > virus, interception or damage to your system. >
MergeRecord, queue & backpressure
Hello. It's me again regarding my mergerecord question. I still don't manage to have what I want, I may have understand how bin based processor works, it's for clarification and a question regarding performance. I want to merge a huge number of 300 octets flowfiles in 128 MB parquet file. My understanding is, for mergerecord to be able to create a bin with 128MB of data, these data must be in queue. We can't feed the bin "one flow at a time", so working with small flowfiles, I have to set the backpressure parameter to something really high, or remove completely the number of flowfile backpressure limit. I understood by reading https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeRecord/additionalDetails.html that it's not the "good" way to do, but I should cascade multiple merge to "slowly" make the flowfile bigger? I've made some test with a single level but I hit the "provenance recording rate". Will multiple level help? Thanks for any help. Aurélien. This electronic transmission (and any attachments thereto) is intended solely for the use of the addressee(s). It may contain confidential or legally privileged information. If you are not the intended recipient of this message, you must delete it immediately and notify the sender. Any unauthorized use or disclosure of this message is strictly prohibited. Faurecia does not guarantee the integrity of this transmission and shall therefore never be liable if the message is altered or falsified nor for any virus, interception or damage to your system.