Re: MergeRecord, queue & backpressure

2018-04-13 Thread Juan Sequeiros
Good afternoon,

Another thing to help you out maybe ...

You can also tweak the nifi.properties setting:

nifi.queue.swap.threshold=2
This setting will control the value of the max flowfile count on a
connection if exceeded it will flush those flowfiles to disk.

I am not sure however there is a distinction between a "record flowfile"
since they live in memory, and the traditional way of thinking of flowfiles.

On Fri, Apr 13, 2018 at 10:49 AM Mark Payne  wrote:

> Aurélien,
>
> In that case you're looking to merge about 500,000 FlowFiles into a single
> FlowFile, so you'll
> definitely want to use a cascading approach. I'd shoot for about 1 MB for
> the first MergeRecord
> and then merge 128 of those together for the second MergeRecord.
>
> The provenance backpressure is occurring because of the large number of
> provenance events being
> generated. One even will be generated, more or less, for each time that a
> Processor touches a FlowFile.
> So if you are merging the FlowFiles together as early as possible, you'll
> reduce the load that you're putting
> on the Provenance Repository.
>
> Also, depending on how you're getting the data into your flow, if you're
> able, it is best to receive a larger "micro-batch"
> of records per flowfile to begin with and not split them up. This would
> greatly alleviate the pressure on the Provenance
> Repository and avoid needing multiple MergeRecord processors as well.
>
> Also, of note, there is a newer version of the Provenance Repository that
> you can switch to, by changing the
> "nifi.provenance.repository.implementation" property in nifi.properties
> from "org.apache.nifi.provenance.PersistentProvenanceRepository"
> to "org.apache.nifi.provenance.WriteAheadProvenanceRepository". The
> Write-Ahead version is quite a bit faster
> and behaves differently than the Persistent Provenance Repo, so you won't
> see those warnings about provenance
> backpressure.
>
> I hope this helps!
> -Mark
>
>
>
> > On Apr 13, 2018, at 10:30 AM, DEHAY Aurelien <
> aurelien.de...@faurecia.com> wrote:
> >
> > Hello.
> >
> > It's me again regarding my mergerecord question.
> >
> > I still don't manage to have what I want, I may have understand how bin
> based processor works, it's for clarification and a question regarding
> performance.
> >
> > I want to merge a huge number of 300 octets flowfiles in 128 MB parquet
> file.
> >
> > My understanding is, for mergerecord to be able to create a bin with
> 128MB of data, these data must be in queue. We can't feed the bin "one flow
> at a time", so working with small flowfiles, I have to set the backpressure
> parameter to something really high, or remove completely the number of
> flowfile backpressure limit.
> >
> > I understood by reading
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeRecord/additionalDetails.html
> that it's not the "good" way to do, but I should cascade multiple merge to
> "slowly" make the flowfile bigger?
> >
> > I've made some test with a single level but I hit the "provenance
> recording rate". Will multiple level help?
> >
> > Thanks for any help.
> >
> > Aurélien.
> >
> > This electronic transmission (and any attachments thereto) is intended
> solely for the use of the addressee(s). It may contain confidential or
> legally privileged information. If you are not the intended recipient of
> this message, you must delete it immediately and notify the sender. Any
> unauthorized use or disclosure of this message is strictly prohibited.
> Faurecia does not guarantee the integrity of this transmission and shall
> therefore never be liable if the message is altered or falsified nor for
> any virus, interception or damage to your system.
> >
>
>


Re: MergeRecord, queue & backpressure

2018-04-13 Thread Mark Payne
Aurélien,

In that case you're looking to merge about 500,000 FlowFiles into a single 
FlowFile, so you'll
definitely want to use a cascading approach. I'd shoot for about 1 MB for the 
first MergeRecord
and then merge 128 of those together for the second MergeRecord.

The provenance backpressure is occurring because of the large number of 
provenance events being
generated. One even will be generated, more or less, for each time that a 
Processor touches a FlowFile.
So if you are merging the FlowFiles together as early as possible, you'll 
reduce the load that you're putting
on the Provenance Repository.

Also, depending on how you're getting the data into your flow, if you're able, 
it is best to receive a larger "micro-batch"
of records per flowfile to begin with and not split them up. This would greatly 
alleviate the pressure on the Provenance
Repository and avoid needing multiple MergeRecord processors as well.

Also, of note, there is a newer version of the Provenance Repository that you 
can switch to, by changing the
"nifi.provenance.repository.implementation" property in nifi.properties from 
"org.apache.nifi.provenance.PersistentProvenanceRepository"
to "org.apache.nifi.provenance.WriteAheadProvenanceRepository". The Write-Ahead 
version is quite a bit faster
and behaves differently than the Persistent Provenance Repo, so you won't see 
those warnings about provenance
backpressure.

I hope this helps!
-Mark



> On Apr 13, 2018, at 10:30 AM, DEHAY Aurelien  
> wrote:
> 
> Hello.
> 
> It's me again regarding my mergerecord question.
> 
> I still don't manage to have what I want, I may have understand how bin based 
> processor works, it's for clarification and a question regarding performance.
> 
> I want to merge a huge number of 300 octets flowfiles in 128 MB parquet file. 
> 
> My understanding is, for mergerecord to be able to create a bin with 128MB of 
> data, these data must be in queue. We can't feed the bin "one flow at a 
> time", so working with small flowfiles, I have to set the backpressure 
> parameter to something really high, or remove completely the number of 
> flowfile backpressure limit.
> 
> I understood by reading 
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeRecord/additionalDetails.html
>  that it's not the "good" way to do, but I should cascade multiple merge to 
> "slowly" make the flowfile bigger?
> 
> I've made some test with a single level but I hit the "provenance recording 
> rate". Will multiple level help?
> 
> Thanks for any help.
> 
> Aurélien.
> 
> This electronic transmission (and any attachments thereto) is intended solely 
> for the use of the addressee(s). It may contain confidential or legally 
> privileged information. If you are not the intended recipient of this 
> message, you must delete it immediately and notify the sender. Any 
> unauthorized use or disclosure of this message is strictly prohibited.  
> Faurecia does not guarantee the integrity of this transmission and shall 
> therefore never be liable if the message is altered or falsified nor for any 
> virus, interception or damage to your system.
> 



MergeRecord, queue & backpressure

2018-04-13 Thread DEHAY Aurelien
Hello.

It's me again regarding my mergerecord question.

I still don't manage to have what I want, I may have understand how bin based 
processor works, it's for clarification and a question regarding performance.

I want to merge a huge number of 300 octets flowfiles in 128 MB parquet file. 

My understanding is, for mergerecord to be able to create a bin with 128MB of 
data, these data must be in queue. We can't feed the bin "one flow at a time", 
so working with small flowfiles, I have to set the backpressure parameter to 
something really high, or remove completely the number of flowfile backpressure 
limit.

I understood by reading 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.MergeRecord/additionalDetails.html
 that it's not the "good" way to do, but I should cascade multiple merge to 
"slowly" make the flowfile bigger?

I've made some test with a single level but I hit the "provenance recording 
rate". Will multiple level help?

Thanks for any help.

Aurélien.

This electronic transmission (and any attachments thereto) is intended solely 
for the use of the addressee(s). It may contain confidential or legally 
privileged information. If you are not the intended recipient of this message, 
you must delete it immediately and notify the sender. Any unauthorized use or 
disclosure of this message is strictly prohibited.  Faurecia does not guarantee 
the integrity of this transmission and shall therefore never be liable if the 
message is altered or falsified nor for any virus, interception or damage to 
your system.