This is a good data point for reference. When you say 50 threads on the processor Sebastian, do you mean that you set "Concurrent tasks" to 50 for the processor? Thank you. - Jim
On Fri, Apr 7, 2017 at 4:20 AM, Sebastian Lagemann < [email protected]> wrote: > Jim, > > we experienced 2k flowfiles per second on HandleHTTPRequest with 50 > threads on the processor without issues, the issue was later in processors > down the flow and primarily related to slow Disk-IO. > > Best, > > Seb > > Am 06.04.2017 um 12:00 schrieb James McMahon <[email protected]>: > > Intriguing. I'm one of those who have employed the "single flowfile" > approach. I'm certainly willing to test out this refinement. > So to press your point, this is more efficient than setting the > processor's "Concurrent tasks" to 10 because it assumes the burden of > initialization for ExecuteScript once, rather than using the processor > configuration parm (which presumably assumes that initialization burden ten > times)? > > I currently set "Concurrent tasks" to 50. The logjam I am seeing is not > in my ExecuteScript processor. My delay is definitely a non-steady, > "non-fast" stream of data at my HandleHttpRequest processor, the first > processor in my workflow. Why that is the case is a mystery we've yet to > resolve. > > One thing I'd welcome is some idea of what is a reasonable expectation for > requests handled by HandleHttpRequest in an hour? Maybe 1500 in an hour is > low, high, or perhaps it is entirely reasonable. We really have little > insight. Any empirical data from user practical experience would be most > welcome. > > Also, I added a second HandleHttpRequest fielding requests on a second > port. I did not see any level of improved throughput. Why might that be? My > expectation was that with two doors open rather than one, I'd see some more > influx of data. > > Thank you. > - Jim > > On Wed, Apr 5, 2017 at 4:26 PM, Scott Wagner <[email protected]> > wrote: > >> One of my experiences is that when using ExecuteScript and Python is that >> having an ExecuteScript that works on an individual FlowFile when you have >> multiple in the input queue is very inefficient, even when you set it to a >> timer of 0 sec. >> >> Instead, I have the following in all of my Python scripts: >> >> flowFiles = session.get(10) >> for flowFile in flowFiles: >> if flowFile is None: >> continue >> # Do stuff here >> >> That seems to improve the throughput of the ExecuteScript processor >> dramatically. >> >> YMMV >> >> - Scott >> >> James McMahon <[email protected]> >> Wednesday, April 5, 2017 12:48 PM >> I am receiving POSTs from a Pentaho process, delivering files to my NiFi >> 0.7.x workflow HandleHttpRequest processor. That processor hands the >> flowfile off to an ExecuteScript processor that runs a python script. This >> script is very, very simple: it takes an incoming JSO object and loads it >> into a Python dictionary, and verifies the presence of required fields >> using simple has_key checks on the dictionary. There are only eight fields >> in the incoming JSON object. >> >> The throughput for these two processes is not exceeding 100-150 files in >> five minutes. It seems very slow in light of the minimal processing going >> on in these two steps. >> >> I notice that there are configuration operations seemingly related to >> optimizing performance. "Concurrent tasks", for example, is only set by >> default to 1 for each processor. >> >> What performance optimizations at the processor level do users recommend? >> Is it advisable to crank up the concurrent tasks for a processor, and is >> there an optimal performance point beyond which you should not crank up >> that value? Are there trade-offs? >> >> I am particularly interested in optimizations for HandleHttpRequest and >> ExecuteScript processors. >> >> Thanks in advance for your thoughts. >> >> cheers, >> >> Jim >> >> >> >
