Thinking more on ways to conquer this problem, I believe I may attack it
from another perspective. Not a very refined one, somewhat Neanderthal, and
doesn't really get ot the heart of the matter. But I think it just may work.

The behavior we consider slow is at the HandleHttpRequest processor. I have
no compelling reason that one single file in a known related subset of
files must be sent per post request. I'm going to try zipping up N files,
and making my first processor step following the HandleHttpRequest be an
unzip.

Jim

On Thu, Apr 6, 2017 at 6:00 AM, James McMahon <[email protected]> wrote:

> Intriguing. I'm one of those who have employed the "single flowfile"
> approach. I'm certainly willing to test out this refinement.
> So to press your point, this is more efficient than setting the
> processor's "Concurrent tasks" to 10 because it assumes the burden of
> initialization for ExecuteScript once, rather than using the processor
> configuration parm (which presumably assumes that initialization burden ten
> times)?
>
> I currently set "Concurrent tasks" to 50.  The logjam I am seeing is not
> in my ExecuteScript processor. My delay is definitely a non-steady,
> "non-fast" stream of data at my HandleHttpRequest processor, the first
> processor in my workflow. Why that is the case is a mystery we've yet to
> resolve.
>
> One thing I'd welcome is some idea of what is a reasonable expectation for
> requests handled by HandleHttpRequest in an hour? Maybe 1500 in an hour is
> low, high, or perhaps it is entirely reasonable. We really have little
> insight. Any empirical data from user practical experience would be most
> welcome.
>
> Also, I added a second HandleHttpRequest fielding requests on a second
> port. I did not see any level of improved throughput. Why might that be? My
> expectation was that with two doors open rather than one, I'd see some more
> influx of data.
>
> Thank you.
> - Jim
>
> On Wed, Apr 5, 2017 at 4:26 PM, Scott Wagner <[email protected]>
> wrote:
>
>> One of my experiences is that when using ExecuteScript and Python is that
>> having an ExecuteScript that works on an individual FlowFile when you have
>> multiple in the input queue is very inefficient, even when you set it to a
>> timer of 0 sec.
>>
>> Instead, I have the following in all of my Python scripts:
>>
>> flowFiles = session.get(10)
>> for flowFile in flowFiles:
>>     if flowFile is None:
>>         continue
>>     # Do stuff here
>>
>> That seems to improve the throughput of the ExecuteScript processor
>> dramatically.
>>
>> YMMV
>>
>> - Scott
>>
>> James McMahon <[email protected]>
>> Wednesday, April 5, 2017 12:48 PM
>> I am receiving POSTs from a Pentaho process, delivering files to my NiFi
>> 0.7.x workflow HandleHttpRequest processor. That processor hands the
>> flowfile off to an ExecuteScript processor that runs a python script. This
>> script is very, very simple: it takes an incoming JSO object and loads it
>> into a Python dictionary, and verifies the presence of required fields
>> using simple has_key checks on the dictionary. There are only eight fields
>> in the incoming JSON object.
>>
>> The throughput for these two processes is not exceeding 100-150 files in
>> five minutes. It seems very slow in light of the minimal processing going
>> on in these two steps.
>>
>> I notice that there are configuration operations seemingly related to
>> optimizing performance. "Concurrent tasks", for example,  is only set by
>> default to 1 for each processor.
>>
>> What performance optimizations at the processor level do users recommend?
>> Is it advisable to crank up the concurrent tasks for a processor, and is
>> there an optimal performance point beyond which you should not crank up
>> that value? Are there trade-offs?
>>
>> I am particularly interested in optimizations for HandleHttpRequest and
>> ExecuteScript processors.
>>
>> Thanks in advance for your thoughts.
>>
>> cheers,
>>
>> Jim
>>
>>
>>
>

Reply via email to