Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

James McMahon Thu, 02 Feb 2017 17:27:12 -0800

Thank you very much Matt. I would be most interested in any insights you
gain if you are able to recreate the problem.


If you have a moment, can you offer up a line of code showing how one might
wrap a call around the byte stream to treat the bytes as a string that can
be matched against using, for instance, a compiled re pattern? I will
definitely look more closely at the Oracle docs link you provided. An
example would help me when I tackle this.  -Jim

On Thu, Feb 2, 2017 at 6:56 PM, Matt Burgess <[email protected]> wrote:

> James,
>
> If you'd rather work with the inputStream as bytes, you don't need the
> IOUtils.toString() call, and I'm not sure what a UTF-8 charset would
> do to your mixed data.  You can wrap any of the *InputStream
> decorators around the inputStream object, such as DataInputStream [1]
> to read various data types from the underlying bytes in the stream.
> Alternatively you may want to read all the bytes into an array you can
> work with directly via Jython methods instead of using Java I/O.
>
> What's weird about the TypeError is that it looks like it is calling a
> different write() method than I would've expected, I wonder if the
> translation of Jython to Java objects is somehow making the processor
> not be able to match up a method signature.  If the error is not
> occurring in the redacted code block above, I will give this script a
> try, to see if I can reproduce and/or fix the error.
>
> Regards,
> Matt
>
> [1] https://docs.oracle.com/javase/8/docs/api/java/io/DataInputStream.html
>
>
> On Thu, Feb 2, 2017 at 6:19 PM, James McMahon <[email protected]>
> wrote:
> > This is very helpful Russell, but in my case each file is a mix of data
> > types. So even if i determine that the flowfile is a mix, I'd still have
> to
> > be poised to tackle it it my ExecuteScript script. Good suggestion,
> though,
> > and one I can use in other ways in my workflows.
> >
> > I do hope someone can tell me what I can do in my callback write back to
> > handle all. I'd like to better understand this error I'm getting, too.
> -Jim
> >
> > On Thu, Feb 2, 2017 at 6:02 PM, Russell Bateman <[email protected]>
> > wrote:
> >>
> >> Could you use RouteOnContent to determine what sort of content you're
> >> dealing with, then branch to different ExecuteScript processors rigged
> to
> >> different Python scripts?
> >>
> >> Hope this comment is helpful.
> >>
> >>
> >> On 02/02/2017 03:38 PM, James McMahon wrote:
> >>
> >> I have a flowfile that has tagged character information I need to get at
> >> throughout the first few sections of the file. I need to use regex in
> python
> >> to select some of those values and to transform others. I am using an
> >> ExecuteScript processor to execute my python code. Here is my approach:
> >>
> >>
> >>
> >> = = = = =
> >>
> >> class PyStreamCallback(StreamCallback) :
> >>
> >>    def __init__ (self) :
> >>
> >>    def process(self, inputSteam, outputStream) :
> >>
> >>       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  #
> >> what happens to my binary and extreme chars when they get passed through
> >> this step?
> >>
> >>      .
> >>
> >>      . (transform and pick out select content)
> >>
> >>      .
> >>
> >>      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I
> >> using the wrong functions to put my text chars and my binary and my
> extreme
> >> chars back on the stream as a byte stream? What should I be doing to
> handle
> >> the variety of data?
> >>
> >>
> >>
> >> flowFile = session.get()
> >>
> >> if (flowFile!= None)
> >>
> >>    incoming = flowFile.getAttribute(‘filename’)
> >>
> >>    logging.info(‘about to process file: %s’, incoming)
> >>
> >>    flowFile = session.write(flowFile, PyStreamCallback())   # line 155
> in
> >> my code
> >>
> >>    session.transfer(flowFile, REL_SUCCESS)
> >>
> >>    session.commit()
> >>
> >>
> >>
> >> = = = = =
> >>
> >>
> >>
> >> When my incoming flowfile is all character content - such as tagged xml
> -
> >> my code works fine. All the flowfiles that also contain some binary data
> >> and/or characters at the extremes such as foreign language characters
> don’t
> >> work. They error out. I suspect it has to do with the way I am writing
> back
> >> to the flowfile stream.
> >>
> >>
> >>
> >> Here is the error I am getting:
> >>
> >> Org.apache.nifi.processor.exception.ProcessException:
> >> javax.script.ScriptException: TypeError: write(): 1st arg can’t be
> coerced
> >> to int, byte[] in <script> at line number 155
> >>
> >>
> >>
> >> How should I handle the write back to the flowfile in cases where I
> have a
> >> mix of character and binary?
> >>
> >>
> >>
> >> Note: I must do this programmatically. I tried using a combination of
> >> SplitContent and MergeContent, but I have no consistent reliable
> demarcation
> >> between the regular text characters and the other more challenging
> >> characters that I can split on.
> >>
> >> All the examples I've found handle more pure circumstances than mine
> seems
> >> to be. For example, all text. Or all JSON. I've not yet been able to
> find an
> >> example that shows me how to write back to the stream for mixed data
> >> situations. Can you help?
> >>
> >>
> >
>

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Reply via email to