Re: Options for increasing performance?

Scott Wagner Fri, 07 Apr 2017 09:58:19 -0700

Jim,

    Here's the full script with unnecessary business logic removed:


flowFiles = session.get(10)
for flowFile in flowFiles:
    if flowFile is None:
        continue
    s3_bucket = flowFile.getAttribute('job.s3_bucket')
    s3_path = flowFile.getAttribute('job.s3_path')
    # More stuff here....
    errors = []
    # More stuff here...
    if len(errors) > 0:

flowFile = session.putAttribute(flowFile, 'job.error',';'.join(errors))

        session.transfer(flowFile, REL_FAILURE)
    else:

flowFile = session.putAttribute(flowFile,'job.number_csv_files', str(len(matches)))flowFile = session.putAttribute(flowFile,'job.total_file_size', str(total_size))

        session.transfer(flowFile, REL_SUCCESS)

    I'm not calling session.commit anywhere.

Here's another script (this one is the full file - no businesssecrets in here!) that creates N number of flowfiles from an input filebased on the attributes defining a numeric range:


import sys
import traceback
from java.nio.charset import StandardCharsets
from org.apache.commons.io import IOUtils
from org.apache.nifi.processor.io import StreamCallback
from org.python.core.util import StringUtil


flowFiles = session.get(10)
for flowFile in flowFiles:
    if flowFile is None:
        continue
    start = int(flowFile.getAttribute('range.start'))
    stop = int(flowFile.getAttribute('range.stop'))
    increment = int(flowFile.getAttribute('range.increment'))
    for x in range(start, stop + 1, increment):
        newFlowFile = session.clone(flowFile)
        newFlowFile = session.putAttribute(newFlowFile, 'current', str(x))
        session.transfer(newFlowFile, REL_SUCCESS)
    session.remove(flowFile)

    I hope these examples are helpful.

- Scott

James McMahon <mailto:[email protected]>
Friday, April 7, 2017 11:22 AM
Scott, how did you refine your session.transfer and session.commitwhen you introduced the for loop?
I am getting a "transfer relationship not specified" when I move mytransfer and my commit into the "for flowFile" loop. Can you show thebottom closure to your # Do stuff here? Thank you sir.
Jim


Scott Wagner <mailto:[email protected]>
Wednesday, April 5, 2017 3:26 PM
One of my experiences is that when using ExecuteScript and Python isthat having an ExecuteScript that works on an individual FlowFile whenyou have multiple in the input queue is very inefficient, even whenyou set it to a timer of 0 sec.
Instead, I have the following in all of my Python scripts:

flowFiles = session.get(10)
for flowFile in flowFiles:
    if flowFile is None:
        continue
    # Do stuff here
That seems to improve the throughput of the ExecuteScript processordramatically.
YMMV

- Scott

James McMahon <mailto:[email protected]>
Wednesday, April 5, 2017 12:48 PM
I am receiving POSTs from a Pentaho process, delivering files to myNiFi 0.7.x workflow HandleHttpRequest processor. That processor handsthe flowfile off to an ExecuteScript processor that runs a pythonscript. This script is very, very simple: it takes an incoming JSOobject and loads it into a Python dictionary, and verifies thepresence of required fields using simple has_key checks on thedictionary. There are only eight fields in the incoming JSON object.
The throughput for these two processes is not exceeding 100-150 filesin five minutes. It seems very slow in light of the minimal processinggoing on in these two steps.
I notice that there are configuration operations seemingly related tooptimizing performance. "Concurrent tasks", for example, is only setby default to 1 for each processor.
What performance optimizations at the processor level do usersrecommend? Is it advisable to crank up the concurrent tasks for aprocessor, and is there an optimal performance point beyond which youshould not crank up that value? Are there trade-offs?
I am particularly interested in optimizations for HandleHttpRequestand ExecuteScript processors.
Thanks in advance for your thoughts.

cheers,

Jim

Re: Options for increasing performance?

Reply via email to