Re: Performance of DistributeLoad - Batch Size

Mark Payne Wed, 16 Sep 2020 06:06:52 -0700

I wasn’t expecting a bug report either :) Re the record stuff: I agree that the 
schema handling can be a bit complicated when you’re getting started.  
Especially if you’re not familiar with Avro and the schema format that it uses. 
But typically once you create a couple of schemas and configure a couple of 
record readers/writers, it starts to make a lot more sense.

Also of note, it’s gotten a *LOT* easier to handle, with the introduction of 
schema inference. If you don’t plan to use a schema registry outside of nifi, 
you can usually just use a Schema Access Strategy of “Infer Schema” for Record 
Readers and a Schema Access Strategy of “Inherit Record Schema.” Most of the 
other schema-related properties can be ignored.

And there’s a PR up for NIFI-1121 [1], which is in review. That should also 
help to make the readers/writers much easier to configure by automatically 
hiding properties that are not relevant when configuring components. For 
example, if you choose a Schema Access Strategy of Infer Schema, there should 
be no need to ask you for the Schema Name and Schema Text, as those don’t 
really apply.

So I do think it’s worth taking the time to learn the Record stuff now - 
performance difference is amazing, and flows are usually much more 
straight-forward. But there’s more we’re doing to make it easier.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-1121

On Sep 15, 2020, at 9:48 PM, Ryan Hendrickson 
<[email protected]<mailto:[email protected]>> 
wrote:

Thanks Mark - I was not expecting a Bug report out of this!  I'll give the 0 
millis a try tomorrow and see what happens.  In fairness, your laptop is 
probably more powerful than the virtual CPUs I'm running on :-).

@Ryan I've got to learn the Record stuff better than I have now... It's the 
whole complicated schema thing that has kept me away for far too long...

Ryan

On Tue, Sep 15, 2020 at 7:04 PM Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:
Hey Ryan,

I tried to replicate the behavior that you’re seeing. I wasn’t seeing behavior 
as slow as what you’re mentioning, but was definitely seeing significantly 
slower performance than I would have expected (reached about 1.5 million/5 mins 
on my laptop, would expect about 8-10 million/5 mins). Did some quick profiling 
and see that it’s due to the NiFi session not handling a large number of 
Provenance Route events well. I created a Jira for this [1]. Interestingly, in 
the interim, you may get better performance by using a Run Duration of 0 millis 
instead of 1 second. That would end up being more expensive in other ways but 
would avoid the issue found in NIFI-7812. Hard to know for sure if it would 
help without trying it out to see.

Hope this helps!
-Mark

https://issues.apache.org/jira/browse/NIFI-7812

On Sep 15, 2020, at 5:42 PM, Ryan Hendrickson 
<[email protected]<mailto:[email protected]>> 
wrote:

Hi Mark,
   I'm using Next Available, and the Destination Queues are set with Zero (0) 
for Back Pressure and Size threshold, so the destinations should not fill up.

   I did switch to using RoundRobin and set it to a yield of 0.  That got me up 
to about 300,000 ff's / 5 minutes.  I was hoping for something around 1,000,000 
ff / 5 minutes.

   The overall flow looks a bit like this: Large amount of flow files -> 
Distribute Load -> PutElasticsearcHttp.

Ryan

On Tue, Sep 15, 2020 at 4:55 PM Mark Payne 
<[email protected]<mailto:[email protected]>> wrote:
Ryan,

I presume you’re using the Round Robin strategy? Looks like that strategy will 
yield the processor if any destination is full. And it sounds like that will be 
very common in your case. Would recommend configuring the Processor and in the 
Settings tab, set the Yield Duration to “0 secs”. I suspect that will give you 
dramatically better performance.

Thanks
-Mark

> On Sep 15, 2020, at 4:41 PM, Ryan Hendrickson 
> <[email protected]<mailto:[email protected]>> 
> wrote:
>
> Hello,
>    I've got 1 million plus FlowFiles (nothing I can do about the count), that 
> goto a DistributeLoad.  The DistributeLoad with 2 threads, a run duration of 
> 1 sec can only sustain ~200,000 FlowFiles / five minutes.
>
>    Is there a better design pattern or a processor that takes a Batch Size to 
> split a Relationship into two or more?
>
> Thanks,
> Ryan

Re: Performance of DistributeLoad - Batch Size

Reply via email to