Re: Performance of DistributeLoad - Batch Size

2020-09-16 Thread Mark Payne
I wasn’t expecting a bug report either :) Re the record stuff: I agree that the 
schema handling can be a bit complicated when you’re getting started.  
Especially if you’re not familiar with Avro and the schema format that it uses. 
But typically once you create a couple of schemas and configure a couple of 
record readers/writers, it starts to make a lot more sense.

Also of note, it’s gotten a *LOT* easier to handle, with the introduction of 
schema inference. If you don’t plan to use a schema registry outside of nifi, 
you can usually just use a Schema Access Strategy of “Infer Schema” for Record 
Readers and a Schema Access Strategy of “Inherit Record Schema.” Most of the 
other schema-related properties can be ignored.

And there’s a PR up for NIFI-1121 [1], which is in review. That should also 
help to make the readers/writers much easier to configure by automatically 
hiding properties that are not relevant when configuring components. For 
example, if you choose a Schema Access Strategy of Infer Schema, there should 
be no need to ask you for the Schema Name and Schema Text, as those don’t 
really apply.

So I do think it’s worth taking the time to learn the Record stuff now - 
performance difference is amazing, and flows are usually much more 
straight-forward. But there’s more we’re doing to make it easier.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-1121

On Sep 15, 2020, at 9:48 PM, Ryan Hendrickson 
mailto:ryan.andrew.hendrick...@gmail.com>> 
wrote:

Thanks Mark - I was not expecting a Bug report out of this!  I'll give the 0 
millis a try tomorrow and see what happens.  In fairness, your laptop is 
probably more powerful than the virtual CPUs I'm running on :-).

@Ryan I've got to learn the Record stuff better than I have now... It's the 
whole complicated schema thing that has kept me away for far too long...

Ryan

On Tue, Sep 15, 2020 at 7:04 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Hey Ryan,

I tried to replicate the behavior that you’re seeing. I wasn’t seeing behavior 
as slow as what you’re mentioning, but was definitely seeing significantly 
slower performance than I would have expected (reached about 1.5 million/5 mins 
on my laptop, would expect about 8-10 million/5 mins). Did some quick profiling 
and see that it’s due to the NiFi session not handling a large number of 
Provenance Route events well. I created a Jira for this [1]. Interestingly, in 
the interim, you may get better performance by using a Run Duration of 0 millis 
instead of 1 second. That would end up being more expensive in other ways but 
would avoid the issue found in NIFI-7812. Hard to know for sure if it would 
help without trying it out to see.

Hope this helps!
-Mark

https://issues.apache.org/jira/browse/NIFI-7812



On Sep 15, 2020, at 5:42 PM, Ryan Hendrickson 
mailto:ryan.andrew.hendrick...@gmail.com>> 
wrote:

Hi Mark,
   I'm using Next Available, and the Destination Queues are set with Zero (0) 
for Back Pressure and Size threshold, so the destinations should not fill up.

   I did switch to using RoundRobin and set it to a yield of 0.  That got me up 
to about 300,000 ff's / 5 minutes.  I was hoping for something around 1,000,000 
ff / 5 minutes.

   The overall flow looks a bit like this: Large amount of flow files -> 
Distribute Load -> PutElasticsearcHttp.

Ryan

On Tue, Sep 15, 2020 at 4:55 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Ryan,

I presume you’re using the Round Robin strategy? Looks like that strategy will 
yield the processor if any destination is full. And it sounds like that will be 
very common in your case. Would recommend configuring the Processor and in the 
Settings tab, set the Yield Duration to “0 secs”. I suspect that will give you 
dramatically better performance.

Thanks
-Mark


> On Sep 15, 2020, at 4:41 PM, Ryan Hendrickson 
> mailto:ryan.andrew.hendrick...@gmail.com>> 
> wrote:
>
> Hello,
>I've got 1 million plus FlowFiles (nothing I can do about the count), that 
> goto a DistributeLoad.  The DistributeLoad with 2 threads, a run duration of 
> 1 sec can only sustain ~200,000 FlowFiles / five minutes.
>
>Is there a better design pattern or a processor that takes a Batch Size to 
> split a Relationship into two or more?
>
> Thanks,
> Ryan





Re: Performance of DistributeLoad - Batch Size

2020-09-15 Thread Ryan Hendrickson
Thanks Mark - I was not expecting a Bug report out of this!  I'll give the
0 millis a try tomorrow and see what happens.  In fairness, your laptop is
probably more powerful than the virtual CPUs I'm running on :-).

@Ryan I've got to learn the Record stuff better than I have now... It's the
whole complicated schema thing that has kept me away for far too long...

Ryan

On Tue, Sep 15, 2020 at 7:04 PM Mark Payne  wrote:

> Hey Ryan,
>
> I tried to replicate the behavior that you’re seeing. I wasn’t seeing
> behavior as slow as what you’re mentioning, but was definitely seeing
> significantly slower performance than I would have expected (reached about
> 1.5 million/5 mins on my laptop, would expect about 8-10 million/5 mins).
> Did some quick profiling and see that it’s due to the NiFi session not
> handling a large number of Provenance Route events well. I created a Jira
> for this [1]. Interestingly, in the interim, you may get better performance
> by using a Run Duration of 0 millis instead of 1 second. That would end up
> being more expensive in other ways but would avoid the issue found in
> NIFI-7812. Hard to know for sure if it would help without trying it out to
> see.
>
> Hope this helps!
> -Mark
>
> https://issues.apache.org/jira/browse/NIFI-7812
>
>
>
> On Sep 15, 2020, at 5:42 PM, Ryan Hendrickson <
> ryan.andrew.hendrick...@gmail.com> wrote:
>
> Hi Mark,
>I'm using Next Available, and the Destination Queues are set with Zero
> (0) for Back Pressure and Size threshold, so the destinations should not
> fill up.
>
>I did switch to using RoundRobin and set it to a yield of 0.  That got
> me up to about 300,000 ff's / 5 minutes.  I was hoping for something around
> 1,000,000 ff / 5 minutes.
>
>The overall flow looks a bit like this: Large amount of flow files ->
> Distribute Load -> PutElasticsearcHttp.
>
> Ryan
>
> On Tue, Sep 15, 2020 at 4:55 PM Mark Payne  wrote:
>
>> Ryan,
>>
>> I presume you’re using the Round Robin strategy? Looks like that strategy
>> will yield the processor if any destination is full. And it sounds like
>> that will be very common in your case. Would recommend configuring the
>> Processor and in the Settings tab, set the Yield Duration to “0 secs”. I
>> suspect that will give you dramatically better performance.
>>
>> Thanks
>> -Mark
>>
>>
>> > On Sep 15, 2020, at 4:41 PM, Ryan Hendrickson <
>> ryan.andrew.hendrick...@gmail.com> wrote:
>> >
>> > Hello,
>> >I've got 1 million plus FlowFiles (nothing I can do about the
>> count), that goto a DistributeLoad.  The DistributeLoad with 2 threads, a
>> run duration of 1 sec can only sustain ~200,000 FlowFiles / five minutes.
>> >
>> >Is there a better design pattern or a processor that takes a Batch
>> Size to split a Relationship into two or more?
>> >
>> > Thanks,
>> > Ryan
>>
>>
>


Re: Performance of DistributeLoad - Batch Size

2020-09-15 Thread Mark Payne
Hey Ryan,

I tried to replicate the behavior that you’re seeing. I wasn’t seeing behavior 
as slow as what you’re mentioning, but was definitely seeing significantly 
slower performance than I would have expected (reached about 1.5 million/5 mins 
on my laptop, would expect about 8-10 million/5 mins). Did some quick profiling 
and see that it’s due to the NiFi session not handling a large number of 
Provenance Route events well. I created a Jira for this [1]. Interestingly, in 
the interim, you may get better performance by using a Run Duration of 0 millis 
instead of 1 second. That would end up being more expensive in other ways but 
would avoid the issue found in NIFI-7812. Hard to know for sure if it would 
help without trying it out to see.

Hope this helps!
-Mark

https://issues.apache.org/jira/browse/NIFI-7812



On Sep 15, 2020, at 5:42 PM, Ryan Hendrickson 
mailto:ryan.andrew.hendrick...@gmail.com>> 
wrote:

Hi Mark,
   I'm using Next Available, and the Destination Queues are set with Zero (0) 
for Back Pressure and Size threshold, so the destinations should not fill up.

   I did switch to using RoundRobin and set it to a yield of 0.  That got me up 
to about 300,000 ff's / 5 minutes.  I was hoping for something around 1,000,000 
ff / 5 minutes.

   The overall flow looks a bit like this: Large amount of flow files -> 
Distribute Load -> PutElasticsearcHttp.

Ryan

On Tue, Sep 15, 2020 at 4:55 PM Mark Payne 
mailto:marka...@hotmail.com>> wrote:
Ryan,

I presume you’re using the Round Robin strategy? Looks like that strategy will 
yield the processor if any destination is full. And it sounds like that will be 
very common in your case. Would recommend configuring the Processor and in the 
Settings tab, set the Yield Duration to “0 secs”. I suspect that will give you 
dramatically better performance.

Thanks
-Mark


> On Sep 15, 2020, at 4:41 PM, Ryan Hendrickson 
> mailto:ryan.andrew.hendrick...@gmail.com>> 
> wrote:
>
> Hello,
>I've got 1 million plus FlowFiles (nothing I can do about the count), that 
> goto a DistributeLoad.  The DistributeLoad with 2 threads, a run duration of 
> 1 sec can only sustain ~200,000 FlowFiles / five minutes.
>
>Is there a better design pattern or a processor that takes a Batch Size to 
> split a Relationship into two or more?
>
> Thanks,
> Ryan




Re: Performance of DistributeLoad - Batch Size

2020-09-15 Thread Ryan Ward
Hi Ryan

I would merge the files into larger files before distribute load and use
PutElasticsearchHttpRecord


On Tue, Sep 15, 2020, 5:43 PM Ryan Hendrickson <
ryan.andrew.hendrick...@gmail.com> wrote:

> Hi Mark,
>I'm using Next Available, and the Destination Queues are set with Zero
> (0) for Back Pressure and Size threshold, so the destinations should not
> fill up.
>
>I did switch to using RoundRobin and set it to a yield of 0.  That got
> me up to about 300,000 ff's / 5 minutes.  I was hoping for something around
> 1,000,000 ff / 5 minutes.
>
>The overall flow looks a bit like this: Large amount of flow files ->
> Distribute Load -> PutElasticsearcHttp.
>
> Ryan
>
> On Tue, Sep 15, 2020 at 4:55 PM Mark Payne  wrote:
>
>> Ryan,
>>
>> I presume you’re using the Round Robin strategy? Looks like that strategy
>> will yield the processor if any destination is full. And it sounds like
>> that will be very common in your case. Would recommend configuring the
>> Processor and in the Settings tab, set the Yield Duration to “0 secs”. I
>> suspect that will give you dramatically better performance.
>>
>> Thanks
>> -Mark
>>
>>
>> > On Sep 15, 2020, at 4:41 PM, Ryan Hendrickson <
>> ryan.andrew.hendrick...@gmail.com> wrote:
>> >
>> > Hello,
>> >I've got 1 million plus FlowFiles (nothing I can do about the
>> count), that goto a DistributeLoad.  The DistributeLoad with 2 threads, a
>> run duration of 1 sec can only sustain ~200,000 FlowFiles / five minutes.
>> >
>> >Is there a better design pattern or a processor that takes a Batch
>> Size to split a Relationship into two or more?
>> >
>> > Thanks,
>> > Ryan
>>
>>


Re: Performance of DistributeLoad - Batch Size

2020-09-15 Thread Ryan Hendrickson
Hi Mark,
   I'm using Next Available, and the Destination Queues are set with Zero
(0) for Back Pressure and Size threshold, so the destinations should not
fill up.

   I did switch to using RoundRobin and set it to a yield of 0.  That got
me up to about 300,000 ff's / 5 minutes.  I was hoping for something around
1,000,000 ff / 5 minutes.

   The overall flow looks a bit like this: Large amount of flow files ->
Distribute Load -> PutElasticsearcHttp.

Ryan

On Tue, Sep 15, 2020 at 4:55 PM Mark Payne  wrote:

> Ryan,
>
> I presume you’re using the Round Robin strategy? Looks like that strategy
> will yield the processor if any destination is full. And it sounds like
> that will be very common in your case. Would recommend configuring the
> Processor and in the Settings tab, set the Yield Duration to “0 secs”. I
> suspect that will give you dramatically better performance.
>
> Thanks
> -Mark
>
>
> > On Sep 15, 2020, at 4:41 PM, Ryan Hendrickson <
> ryan.andrew.hendrick...@gmail.com> wrote:
> >
> > Hello,
> >I've got 1 million plus FlowFiles (nothing I can do about the count),
> that goto a DistributeLoad.  The DistributeLoad with 2 threads, a run
> duration of 1 sec can only sustain ~200,000 FlowFiles / five minutes.
> >
> >Is there a better design pattern or a processor that takes a Batch
> Size to split a Relationship into two or more?
> >
> > Thanks,
> > Ryan
>
>


Re: Performance of DistributeLoad - Batch Size

2020-09-15 Thread Mark Payne
Ryan,

I presume you’re using the Round Robin strategy? Looks like that strategy will 
yield the processor if any destination is full. And it sounds like that will be 
very common in your case. Would recommend configuring the Processor and in the 
Settings tab, set the Yield Duration to “0 secs”. I suspect that will give you 
dramatically better performance.

Thanks
-Mark


> On Sep 15, 2020, at 4:41 PM, Ryan Hendrickson 
>  wrote:
> 
> Hello,
>I've got 1 million plus FlowFiles (nothing I can do about the count), that 
> goto a DistributeLoad.  The DistributeLoad with 2 threads, a run duration of 
> 1 sec can only sustain ~200,000 FlowFiles / five minutes.
> 
>Is there a better design pattern or a processor that takes a Batch Size to 
> split a Relationship into two or more?
> 
> Thanks,
> Ryan