Interesting.  Thanks for the update and the template.  I use osx as a
playground, but this will have to be implemented on RHEL.  I'll see about
downloading or building this and testing.  Performance will be critical due
to the volume of data; I've run into some python-based detection libraries
that slowed the process way down.

A related project, jchardet[1] looks interesting as a possible start for a
custom processor.

[1] http://jchardet.sourceforge.net/



On Tue, Nov 24, 2015 at 11:29 AM, Joe Percivall <joeperciv...@yahoo.com>
wrote:

> Hello Charlie,
>
> I was looking back through and saw this wasn't totally resolved yet.
>
>
> Couple questions. First, what system are you using? There are a couple of
> options for the stream command depending on what you're using. Also are you
> able to get new commands (using yum or brew)?
>
> The key thing I want to solve is to find the encoding of a file just based
> on it contents and not relying on having access to the original file.
> ExecuteStreamCommand should enable this. This is because you can just pass
> any FlowFile into ExecuteStreamCommand then it can route the FlowFile
> contents to STDIN for the command to execute on.
>
> Mac's (what I am using) default command for finding file encodings is
> "file -bi filename.txt" but it doesn't allow you to pass in a file via
> STDIN. I found a command called "uchardet"[1] which finds file encodings
> and allows you to pass the file in via STDIN.
>
> I attached a template that takes in a file using GetFile (deletes the
> original) and routes that FlowFile to ExecuteStreamCommand.
> ExecuteStreamCommand then runs "uchardet" on the contents of the FlowFile
> and outputs the encoding to the "encoding" attribute of the original
> FlowFile.
>
> [1] https://github.com/BYVoid/uchardet
>
> If this doesn't satisfy your needs just let me know!
> Joe
>
> - - - - - -
> Joseph Percivall
> linkedin.com/in/Percivall
> e: joeperciv...@yahoo.com
>
>
>
>
> On Friday, November 20, 2015 9:53 AM, Charlie Frasure <
> charliefras...@gmail.com> wrote:
>
>
>
> I'm definitely game for that.  Let me know what I can do to help.
>
>
>
> On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt <joe.w...@gmail.com> wrote:
>
> Charlie
> >
> >Got ya.  I missed the 'encoding vs content type' thing.  I agree let's
> >find a way to avoid the extra copy.  We dont expose the storage
> >location of the underlying bytes.  So on the ListFile thing.  What I
> >was thinking was this (and honestly I've not tested this so maybe i'm
> >skipping something important)
> >
> >ListFile to get a listing of names/etc.. of interest
> >
> >Execute the 'file --mime-encoding ${filename}' to get more attributes
> >available to work with
> >
> >RouteOnAttribute to decide what to do with the file next.  You can
> >Fetch/delete what you don't want you can Fetch/pass on what you do
> >
> >I was looking for a way to check the mime-encoding while passing the
> >data to detect into an input stream.  because that is actually how
> >execute stream command wants to work.
> >
> >This is a use case that should be pretty easy so if you're willing to
> >chat through it with us we'll figure out a path to make it work well.
> >
> >Thanks
> >Joe
> >
> >On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure
> >
> ><charliefras...@gmail.com> wrote:
> >> Thanks Joe,
> >>
> >> The use case is that I'm receiving data without knowing what character
> set
> >> it is coming in.  --mime-encoding is giving it's best guess on
> character set
> >> rather than the content type.
> >>
> >> The ListFile sounds interesting, but I wonder if I really even need
> that.  I
> >> don't want to leave the files in place, I just want to run an external
> >> command on them as part of the data flow.  Is there a way I can run an
> >> external command against the physical file such as
> >> /opt/nifi/somedir/12345.uuid?  Would that info be in an attribute
> somewhere?
> >> It just seems wasteful to make an extra copy of the file, in order to
> run a
> >> read-only command on it, then delete it.  If ListFiles is still the
> right
> >> way to go, please let me know.
> >>
> >>
> >> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt <joe.w...@gmail.com> wrote:
> >>>
> >>> For identifying the mime type you may have sufficient results with the
> >>> existing processor 'IdentifyMimeType' which you can put into the flow.
> >>>
> >>> For better logic around identifying files to pull but first calling an
> >>> external command to learn more about them the upcoming
> >>> ListFile/FetchFile combo that comes from this JIRA [1] might give you
> >>> better flexibility.
> >>>
> >>> [1] https://issues.apache.org/jira/browse/NIFI-631
> >>>
> >>> Thanks
> >>> Joe
> >>>
> >>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure
> >>> <charliefras...@gmail.com> wrote:
> >>> > Thanks everyone for the help.  The trouble started a few processors
> >>> > earlier
> >>> > in an ExecuteStreamCommand on ${filename} with the result of "file
> not
> >>> > found".  I had originally set my GetFile processor to not remove
> files,
> >>> > but
> >>> > recently changed that.  Now it seems that my ExecuteStreamCommand may
> >>> > not be
> >>> > the best way to accomplish this.
> >>> >
> >>> > The command that gets executed is: file -b --mime-encoding
> ${filename}
> >>> > in the working directory: ${absolute.path}
> >>> >
> >>> > Now that the file is no longer in the source directory when the
> >>> > processor
> >>> > fires, the command is broken.  I could PutFile somewhere
> temporarily; is
> >>> > there a better way?
> >>> >
> >>> > On Thu, Nov 19, 2015 at 10:33 PM, Joe Witt <joe.w...@gmail.com>
> wrote:
> >>> >>
> >>> >> Charlie,
> >>> >>
> >>> >> The fact that this is confusing is something we agree should be more
> >>> >> clear and we will improve.  We're tackling it based on what is
> >>> >> mentioned here [1].
> >>> >>
> >>> >> [1]
> >>> >>
> >>> >>
> https://cwiki.apache.org/confluence/display/NIFI/Interactive+Queue+Management
> >>> >>
> >>> >> Thanks
> >>> >> Joe
> >>> >>
> >>> >> On Thu, Nov 19, 2015 at 10:30 PM, Corey Flowers
> >>> >> <cflow...@onyxpoint.com>
> >>> >> wrote:
> >>> >> > These guys are right. The file to look in for the uuid is the
> >>> >> > nifi-app.log.
> >>> >> > Also if you wanted to see what the processor itself was doing, you
> >>> >> > could
> >>> >> > right click on the processor, get its uuid and while it is
> running,
> >>> >> > run
> >>> >> > (assuming it is on Linux):
> >>> >> >
> >>> >> > tail -F nifi-app.log | grep uuid
> >>> >> >
> >>> >> > This will just scroll the logs for that specific processor and
> will
> >>> >> > show
> >>> >> > you
> >>> >> > what it is doing. It should also tell you specific file names and
> >>> >> > uuids
> >>> >> > of
> >>> >> > the failing files.
> >>> >> >
> >>> >> > Hope that helps! Have a great night and good luck!
> >>> >> >
> >>> >> > Sent from my iPhone
> >>> >> >
> >>> >> > On Nov 19, 2015, at 9:27 PM, Juan Sequeiros <helloj...@gmail.com>
> >>> >> > wrote:
> >>> >> >
> >>> >> > You can also check the NiFi logs for a searchable id or for what
> the
> >>> >> > previous processor ID produced to help search provenance.
> >>> >> >
> >>> >> > On Nov 19, 2015 21:22, "Bryan Bende" <bbe...@gmail.com> wrote:
> >>> >> >>
> >>> >> >> Charlie,
> >>> >> >>
> >>> >> >> The behavior you described usually means that the processor
> >>> >> >> encountered
> >>> >> >> an
> >>> >> >> unexpected error which was thrown back to the framework which
> rolls
> >>> >> >> back the
> >>> >> >> processing of that flow file and leaves it in the queue, as
> opposed
> >>> >> >> to
> >>> >> >> an
> >>> >> >> error it expected where it would usually route to a failure
> >>> >> >> relationship.
> >>> >> >>
> >>> >> >> Is the id that you see in the bulletin a uuid?
> >>> >> >>
> >>> >> >> There should still be some provenance events for this FlowFile
> from
> >>> >> >> the
> >>> >> >> previous points in the flow. If it looks like the uuid of the
> >>> >> >> FlowFile,
> >>> >> >> that
> >>> >> >> should be searchable from provenance using the search button on
> the
> >>> >> >> right.
> >>> >> >> Let us know if we can help more.
> >>> >> >>
> >>> >> >> -Bryan
> >>> >> >>
> >>> >> >> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure
> >>> >> >> <charliefras...@gmail.com> wrote:
> >>> >> >>>
> >>> >> >>> I have a question on troubleshooting a flow.  I've built a flow
> >>> >> >>> with
> >>> >> >>> no
> >>> >> >>> exception routing, just trying to process the expected values
> >>> >> >>> first.
> >>> >> >>> When a
> >>> >> >>> file exposes a problem with the logic in my flow, it queues up
> >>> >> >>> prior
> >>> >> >>> to the
> >>> >> >>> flow that is raising the bulletin.
> >>> >> >>>
> >>> >> >>> In the bulletin, I can see an id, but can't tell which file it
> is.
> >>> >> >>> Data
> >>> >> >>> provenance doesn't seem to help as it passed the flow on the
> last
> >>> >> >>> processor,
> >>> >> >>> but hasn't been logged (to my knowledge) on the next one.
> >>> >> >>>
> >>> >> >>> Is there a way to match the bulletin back to a file without
> >>> >> >>> creating a
> >>> >> >>> route for failed files?
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >
> >>> >
> >>
> >>
> >
>

Reply via email to