Intriguing. I'm one of those who have employed the "single flowfile"
approach. I'm certainly willing to test out this refinement.
So to press your point, this is more efficient than setting the processor's
"Concurrent tasks" to 10 because it assumes the burden of initialization
for ExecuteScript once, rather than using the processor configuration
parm (which presumably assumes that initialization burden ten times)?

I currently set "Concurrent tasks" to 50.  The logjam I am seeing is not in
my ExecuteScript processor. My delay is definitely a non-steady, "non-fast"
stream of data at my HandleHttpRequest processor, the first processor in my
workflow. Why that is the case is a mystery we've yet to resolve.

One thing I'd welcome is some idea of what is a reasonable expectation for
requests handled by HandleHttpRequest in an hour? Maybe 1500 in an hour is
low, high, or perhaps it is entirely reasonable. We really have little
insight. Any empirical data from user practical experience would be most
welcome.

Also, I added a second HandleHttpRequest fielding requests on a second
port. I did not see any level of improved throughput. Why might that be? My
expectation was that with two doors open rather than one, I'd see some more
influx of data.

Thank you.
- Jim

On Wed, Apr 5, 2017 at 4:26 PM, Scott Wagner <[email protected]>
wrote:

> One of my experiences is that when using ExecuteScript and Python is that
> having an ExecuteScript that works on an individual FlowFile when you have
> multiple in the input queue is very inefficient, even when you set it to a
> timer of 0 sec.
>
> Instead, I have the following in all of my Python scripts:
>
> flowFiles = session.get(10)
> for flowFile in flowFiles:
>     if flowFile is None:
>         continue
>     # Do stuff here
>
> That seems to improve the throughput of the ExecuteScript processor
> dramatically.
>
> YMMV
>
> - Scott
>
> James McMahon <[email protected]>
> Wednesday, April 5, 2017 12:48 PM
> I am receiving POSTs from a Pentaho process, delivering files to my NiFi
> 0.7.x workflow HandleHttpRequest processor. That processor hands the
> flowfile off to an ExecuteScript processor that runs a python script. This
> script is very, very simple: it takes an incoming JSO object and loads it
> into a Python dictionary, and verifies the presence of required fields
> using simple has_key checks on the dictionary. There are only eight fields
> in the incoming JSON object.
>
> The throughput for these two processes is not exceeding 100-150 files in
> five minutes. It seems very slow in light of the minimal processing going
> on in these two steps.
>
> I notice that there are configuration operations seemingly related to
> optimizing performance. "Concurrent tasks", for example,  is only set by
> default to 1 for each processor.
>
> What performance optimizations at the processor level do users recommend?
> Is it advisable to crank up the concurrent tasks for a processor, and is
> there an optimal performance point beyond which you should not crank up
> that value? Are there trade-offs?
>
> I am particularly interested in optimizations for HandleHttpRequest and
> ExecuteScript processors.
>
> Thanks in advance for your thoughts.
>
> cheers,
>
> Jim
>
>
>

Reply via email to