Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Matt Burgess Thu, 02 Feb 2017 20:21:45 -0800

Can you share with us a little more information about the
schema/format of your incoming data?  Is there always a tag before a
data item, for example?


Thanks,
Matt

On Thu, Feb 2, 2017 at 8:26 PM, James McMahon <[email protected]> wrote:
> Thank you very much Matt. I would be most interested in any insights you
> gain if you are able to recreate the problem.
>
> If you have a moment, can you offer up a line of code showing how one might
> wrap a call around the byte stream to treat the bytes as a string that can
> be matched against using, for instance, a compiled re pattern? I will
> definitely look more closely at the Oracle docs link you provided. An
> example would help me when I tackle this.  -Jim
>
> On Thu, Feb 2, 2017 at 6:56 PM, Matt Burgess <[email protected]> wrote:
>>
>> James,
>>
>> If you'd rather work with the inputStream as bytes, you don't need the
>> IOUtils.toString() call, and I'm not sure what a UTF-8 charset would
>> do to your mixed data.  You can wrap any of the *InputStream
>> decorators around the inputStream object, such as DataInputStream [1]
>> to read various data types from the underlying bytes in the stream.
>> Alternatively you may want to read all the bytes into an array you can
>> work with directly via Jython methods instead of using Java I/O.
>>
>> What's weird about the TypeError is that it looks like it is calling a
>> different write() method than I would've expected, I wonder if the
>> translation of Jython to Java objects is somehow making the processor
>> not be able to match up a method signature.  If the error is not
>> occurring in the redacted code block above, I will give this script a
>> try, to see if I can reproduce and/or fix the error.
>>
>> Regards,
>> Matt
>>
>> [1] https://docs.oracle.com/javase/8/docs/api/java/io/DataInputStream.html
>>
>>
>> On Thu, Feb 2, 2017 at 6:19 PM, James McMahon <[email protected]>
>> wrote:
>> > This is very helpful Russell, but in my case each file is a mix of data
>> > types. So even if i determine that the flowfile is a mix, I'd still have
>> > to
>> > be poised to tackle it it my ExecuteScript script. Good suggestion,
>> > though,
>> > and one I can use in other ways in my workflows.
>> >
>> > I do hope someone can tell me what I can do in my callback write back to
>> > handle all. I'd like to better understand this error I'm getting, too.
>> > -Jim
>> >
>> > On Thu, Feb 2, 2017 at 6:02 PM, Russell Bateman <[email protected]>
>> > wrote:
>> >>
>> >> Could you use RouteOnContent to determine what sort of content you're
>> >> dealing with, then branch to different ExecuteScript processors rigged
>> >> to
>> >> different Python scripts?
>> >>
>> >> Hope this comment is helpful.
>> >>
>> >>
>> >> On 02/02/2017 03:38 PM, James McMahon wrote:
>> >>
>> >> I have a flowfile that has tagged character information I need to get
>> >> at
>> >> throughout the first few sections of the file. I need to use regex in
>> >> python
>> >> to select some of those values and to transform others. I am using an
>> >> ExecuteScript processor to execute my python code. Here is my approach:
>> >>
>> >>
>> >>
>> >> = = = = =
>> >>
>> >> class PyStreamCallback(StreamCallback) :
>> >>
>> >>    def __init__ (self) :
>> >>
>> >>    def process(self, inputSteam, outputStream) :
>> >>
>> >>       stuff = IOUtils.toString(inputStream, StandardCharsets.UTF_8)  #
>> >> what happens to my binary and extreme chars when they get passed
>> >> through
>> >> this step?
>> >>
>> >>      .
>> >>
>> >>      . (transform and pick out select content)
>> >>
>> >>      .
>> >>
>> >>      outputStream.write(bytearray(stuff.encode(‘utf-8’))))     # am I
>> >> using the wrong functions to put my text chars and my binary and my
>> >> extreme
>> >> chars back on the stream as a byte stream? What should I be doing to
>> >> handle
>> >> the variety of data?
>> >>
>> >>
>> >>
>> >> flowFile = session.get()
>> >>
>> >> if (flowFile!= None)
>> >>
>> >>    incoming = flowFile.getAttribute(‘filename’)
>> >>
>> >>    logging.info(‘about to process file: %s’, incoming)
>> >>
>> >>    flowFile = session.write(flowFile, PyStreamCallback())   # line 155
>> >> in
>> >> my code
>> >>
>> >>    session.transfer(flowFile, REL_SUCCESS)
>> >>
>> >>    session.commit()
>> >>
>> >>
>> >>
>> >> = = = = =
>> >>
>> >>
>> >>
>> >> When my incoming flowfile is all character content - such as tagged xml
>> >> -
>> >> my code works fine. All the flowfiles that also contain some binary
>> >> data
>> >> and/or characters at the extremes such as foreign language characters
>> >> don’t
>> >> work. They error out. I suspect it has to do with the way I am writing
>> >> back
>> >> to the flowfile stream.
>> >>
>> >>
>> >>
>> >> Here is the error I am getting:
>> >>
>> >> Org.apache.nifi.processor.exception.ProcessException:
>> >> javax.script.ScriptException: TypeError: write(): 1st arg can’t be
>> >> coerced
>> >> to int, byte[] in <script> at line number 155
>> >>
>> >>
>> >>
>> >> How should I handle the write back to the flowfile in cases where I
>> >> have a
>> >> mix of character and binary?
>> >>
>> >>
>> >>
>> >> Note: I must do this programmatically. I tried using a combination of
>> >> SplitContent and MergeContent, but I have no consistent reliable
>> >> demarcation
>> >> between the regular text characters and the other more challenging
>> >> characters that I can split on.
>> >>
>> >> All the examples I've found handle more pure circumstances than mine
>> >> seems
>> >> to be. For example, all text. Or all JSON. I've not yet been able to
>> >> find an
>> >> example that shows me how to write back to the stream for mixed data
>> >> situations. Can you help?
>> >>
>> >>
>> >
>
>

Re: Writing back through a python stream callback when the flowfile content is a mix of character and binary

Reply via email to