Re: queued files

Joe Percivall Tue, 24 Nov 2015 11:11:59 -0800

Not a problem, I'd be interested in any follow-up details.

I agree that it should be a separate processor since this is an almost atomic 
unit of work that can be used in many different work-flows. I created a jira 
for this new processor: https://issues.apache.org/jira/browse/NIFI-1217



Joe
- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: [email protected]




On Tuesday, November 24, 2015 1:49 PM, Charlie Frasure 
<[email protected]> wrote:



Interesting.  Thanks for the update and the template.  I use osx as a 
playground, but this will have to be implemented on RHEL.  I'll see about 
downloading or building this and testing.  Performance will be critical due to 
the volume of data; I've run into some python-based detection libraries that 
slowed the process way down.


A related project, jchardet[1] looks interesting as a possible start for a 
custom processor.

[1] http://jchardet.sourceforge.net/




On Tue, Nov 24, 2015 at 11:29 AM, Joe Percivall <[email protected]> wrote:

Hello Charlie,
>
>I was looking back through and saw this wasn't totally resolved yet.
>
>
>Couple questions. First, what system are you using? There are a couple of 
>options for the stream command depending on what you're using. Also are you 
>able to get new commands (using yum or brew)?
>
>The key thing I want to solve is to find the encoding of a file just based on 
>it contents and not relying on having access to the original file. 
>ExecuteStreamCommand should enable this. This is because you can just pass any 
>FlowFile into ExecuteStreamCommand then it can route the FlowFile contents to 
>STDIN for the command to execute on.
>
>Mac's (what I am using) default command for finding file encodings is "file 
>-bi filename.txt" but it doesn't allow you to pass in a file via STDIN. I 
>found a command called "uchardet"[1] which finds file encodings and allows you 
>to pass the file in via STDIN.
>
>I attached a template that takes in a file using GetFile (deletes the 
>original) and routes that FlowFile to ExecuteStreamCommand. 
>ExecuteStreamCommand then runs "uchardet" on the contents of the FlowFile and 
>outputs the encoding to the "encoding" attribute of the original FlowFile.
>
>[1] https://github.com/BYVoid/uchardet
>
>If this doesn't satisfy your needs just let me know!
>Joe
>
>- - - - - -
>Joseph Percivall
>linkedin.com/in/Percivall
>e: [email protected]
>
>
>
>
>
>On Friday, November 20, 2015 9:53 AM, Charlie Frasure 
><[email protected]> wrote:
>
>
>
>I'm definitely game for that.  Let me know what I can do to help.
>
>
>
>On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt <[email protected]> wrote:
>
>Charlie
>>
>>Got ya.  I missed the 'encoding vs content type' thing.  I agree let's
>>find a way to avoid the extra copy.  We dont expose the storage
>>location of the underlying bytes.  So on the ListFile thing.  What I
>>was thinking was this (and honestly I've not tested this so maybe i'm
>>skipping something important)
>>
>>ListFile to get a listing of names/etc.. of interest
>>
>>Execute the 'file --mime-encoding ${filename}' to get more attributes
>>available to work with
>>
>>RouteOnAttribute to decide what to do with the file next.  You can
>>Fetch/delete what you don't want you can Fetch/pass on what you do
>>
>>I was looking for a way to check the mime-encoding while passing the
>>data to detect into an input stream.  because that is actually how
>>execute stream command wants to work.
>>
>>This is a use case that should be pretty easy so if you're willing to
>>chat through it with us we'll figure out a path to make it work well.
>>
>>Thanks
>>Joe
>>
>>On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure
>>
>><[email protected]> wrote:
>>> Thanks Joe,
>>>
>>> The use case is that I'm receiving data without knowing what character set
>>> it is coming in.  --mime-encoding is giving it's best guess on character set
>>> rather than the content type.
>>>
>>> The ListFile sounds interesting, but I wonder if I really even need that.  I
>>> don't want to leave the files in place, I just want to run an external
>>> command on them as part of the data flow.  Is there a way I can run an
>>> external command against the physical file such as
>>> /opt/nifi/somedir/12345.uuid?  Would that info be in an attribute somewhere?
>>> It just seems wasteful to make an extra copy of the file, in order to run a
>>> read-only command on it, then delete it.  If ListFiles is still the right
>>> way to go, please let me know.
>>>
>>>
>>> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt <[email protected]> wrote:
>>>>
>>>> For identifying the mime type you may have sufficient results with the
>>>> existing processor 'IdentifyMimeType' which you can put into the flow.
>>>>
>>>> For better logic around identifying files to pull but first calling an
>>>> external command to learn more about them the upcoming
>>>> ListFile/FetchFile combo that comes from this JIRA [1] might give you
>>>> better flexibility.
>>>>
>>>> [1] https://issues.apache.org/jira/browse/NIFI-631
>>>>
>>>> Thanks
>>>> Joe
>>>>
>>>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure
>>>> <[email protected]> wrote:
>>>> > Thanks everyone for the help.  The trouble started a few processors
>>>> > earlier
>>>> > in an ExecuteStreamCommand on ${filename} with the result of "file not
>>>> > found".  I had originally set my GetFile processor to not remove files,
>>>> > but
>>>> > recently changed that.  Now it seems that my ExecuteStreamCommand may
>>>> > not be
>>>> > the best way to accomplish this.
>>>> >
>>>> > The command that gets executed is: file -b --mime-encoding ${filename}
>>>> > in the working directory: ${absolute.path}
>>>> >
>>>> > Now that the file is no longer in the source directory when the
>>>> > processor
>>>> > fires, the command is broken.  I could PutFile somewhere temporarily; is
>>>> > there a better way?
>>>> >
>>>> > On Thu, Nov 19, 2015 at 10:33 PM, Joe Witt <[email protected]> wrote:
>>>> >>
>>>> >> Charlie,
>>>> >>
>>>> >> The fact that this is confusing is something we agree should be more
>>>> >> clear and we will improve.  We're tackling it based on what is
>>>> >> mentioned here [1].
>>>> >>
>>>> >> [1]
>>>> >>
>>>> >> https://cwiki.apache.org/confluence/display/NIFI/Interactive+Queue+Management
>>>> >>
>>>> >> Thanks
>>>> >> Joe
>>>> >>
>>>> >> On Thu, Nov 19, 2015 at 10:30 PM, Corey Flowers
>>>> >> <[email protected]>
>>>> >> wrote:
>>>> >> > These guys are right. The file to look in for the uuid is the
>>>> >> > nifi-app.log.
>>>> >> > Also if you wanted to see what the processor itself was doing, you
>>>> >> > could
>>>> >> > right click on the processor, get its uuid and while it is running,
>>>> >> > run
>>>> >> > (assuming it is on Linux):
>>>> >> >
>>>> >> > tail -F nifi-app.log | grep uuid
>>>> >> >
>>>> >> > This will just scroll the logs for that specific processor and will
>>>> >> > show
>>>> >> > you
>>>> >> > what it is doing. It should also tell you specific file names and
>>>> >> > uuids
>>>> >> > of
>>>> >> > the failing files.
>>>> >> >
>>>> >> > Hope that helps! Have a great night and good luck!
>>>> >> >
>>>> >> > Sent from my iPhone
>>>> >> >
>>>> >> > On Nov 19, 2015, at 9:27 PM, Juan Sequeiros <[email protected]>
>>>> >> > wrote:
>>>> >> >
>>>> >> > You can also check the NiFi logs for a searchable id or for what the
>>>> >> > previous processor ID produced to help search provenance.
>>>> >> >
>>>> >> > On Nov 19, 2015 21:22, "Bryan Bende" <[email protected]> wrote:
>>>> >> >>
>>>> >> >> Charlie,
>>>> >> >>
>>>> >> >> The behavior you described usually means that the processor
>>>> >> >> encountered
>>>> >> >> an
>>>> >> >> unexpected error which was thrown back to the framework which rolls
>>>> >> >> back the
>>>> >> >> processing of that flow file and leaves it in the queue, as opposed
>>>> >> >> to
>>>> >> >> an
>>>> >> >> error it expected where it would usually route to a failure
>>>> >> >> relationship.
>>>> >> >>
>>>> >> >> Is the id that you see in the bulletin a uuid?
>>>> >> >>
>>>> >> >> There should still be some provenance events for this FlowFile from
>>>> >> >> the
>>>> >> >> previous points in the flow. If it looks like the uuid of the
>>>> >> >> FlowFile,
>>>> >> >> that
>>>> >> >> should be searchable from provenance using the search button on the
>>>> >> >> right.
>>>> >> >> Let us know if we can help more.
>>>> >> >>
>>>> >> >> -Bryan
>>>> >> >>
>>>> >> >> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure
>>>> >> >> <[email protected]> wrote:
>>>> >> >>>
>>>> >> >>> I have a question on troubleshooting a flow.  I've built a flow
>>>> >> >>> with
>>>> >> >>> no
>>>> >> >>> exception routing, just trying to process the expected values
>>>> >> >>> first.
>>>> >> >>> When a
>>>> >> >>> file exposes a problem with the logic in my flow, it queues up
>>>> >> >>> prior
>>>> >> >>> to the
>>>> >> >>> flow that is raising the bulletin.
>>>> >> >>>
>>>> >> >>> In the bulletin, I can see an id, but can't tell which file it is.
>>>> >> >>> Data
>>>> >> >>> provenance doesn't seem to help as it passed the flow on the last
>>>> >> >>> processor,
>>>> >> >>> but hasn't been logged (to my knowledge) on the next one.
>>>> >> >>>
>>>> >> >>> Is there a way to match the bulletin back to a file without
>>>> >> >>> creating a
>>>> >> >>> route for failed files?
>>>> >> >>
>>>> >> >>
>>>> >> >
>>>> >
>>>> >
>>>
>>>
>>

Re: queued files

Reply via email to