One of my experiences is that when using ExecuteScript and Python is
that having an ExecuteScript that works on an individual FlowFile when
you have multiple in the input queue is very inefficient, even when you
set it to a timer of 0 sec.
Instead, I have the following in all of my Python scripts:
flowFiles = session.get(10)
for flowFile in flowFiles:
if flowFile is None:
continue
# Do stuff here
That seems to improve the throughput of the ExecuteScript processor
dramatically.
YMMV
- Scott
James McMahon <mailto:[email protected]>
Wednesday, April 5, 2017 12:48 PM
I am receiving POSTs from a Pentaho process, delivering files to my
NiFi 0.7.x workflow HandleHttpRequest processor. That processor hands
the flowfile off to an ExecuteScript processor that runs a python
script. This script is very, very simple: it takes an incoming JSO
object and loads it into a Python dictionary, and verifies the
presence of required fields using simple has_key checks on the
dictionary. There are only eight fields in the incoming JSON object.
The throughput for these two processes is not exceeding 100-150 files
in five minutes. It seems very slow in light of the minimal processing
going on in these two steps.
I notice that there are configuration operations seemingly related to
optimizing performance. "Concurrent tasks", for example, is only set
by default to 1 for each processor.
What performance optimizations at the processor level do users
recommend? Is it advisable to crank up the concurrent tasks for a
processor, and is there an optimal performance point beyond which you
should not crank up that value? Are there trade-offs?
I am particularly interested in optimizations for HandleHttpRequest
and ExecuteScript processors.
Thanks in advance for your thoughts.
cheers,
Jim