Interesting. Thanks for the update and the template. I use osx as a playground, but this will have to be implemented on RHEL. I'll see about downloading or building this and testing. Performance will be critical due to the volume of data; I've run into some python-based detection libraries that slowed the process way down.
A related project, jchardet[1] looks interesting as a possible start for a custom processor. [1] http://jchardet.sourceforge.net/ On Tue, Nov 24, 2015 at 11:29 AM, Joe Percivall <joeperciv...@yahoo.com> wrote: > Hello Charlie, > > I was looking back through and saw this wasn't totally resolved yet. > > > Couple questions. First, what system are you using? There are a couple of > options for the stream command depending on what you're using. Also are you > able to get new commands (using yum or brew)? > > The key thing I want to solve is to find the encoding of a file just based > on it contents and not relying on having access to the original file. > ExecuteStreamCommand should enable this. This is because you can just pass > any FlowFile into ExecuteStreamCommand then it can route the FlowFile > contents to STDIN for the command to execute on. > > Mac's (what I am using) default command for finding file encodings is > "file -bi filename.txt" but it doesn't allow you to pass in a file via > STDIN. I found a command called "uchardet"[1] which finds file encodings > and allows you to pass the file in via STDIN. > > I attached a template that takes in a file using GetFile (deletes the > original) and routes that FlowFile to ExecuteStreamCommand. > ExecuteStreamCommand then runs "uchardet" on the contents of the FlowFile > and outputs the encoding to the "encoding" attribute of the original > FlowFile. > > [1] https://github.com/BYVoid/uchardet > > If this doesn't satisfy your needs just let me know! > Joe > > - - - - - - > Joseph Percivall > linkedin.com/in/Percivall > e: joeperciv...@yahoo.com > > > > > On Friday, November 20, 2015 9:53 AM, Charlie Frasure < > charliefras...@gmail.com> wrote: > > > > I'm definitely game for that. Let me know what I can do to help. > > > > On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt <joe.w...@gmail.com> wrote: > > Charlie > > > >Got ya. I missed the 'encoding vs content type' thing. I agree let's > >find a way to avoid the extra copy. We dont expose the storage > >location of the underlying bytes. So on the ListFile thing. What I > >was thinking was this (and honestly I've not tested this so maybe i'm > >skipping something important) > > > >ListFile to get a listing of names/etc.. of interest > > > >Execute the 'file --mime-encoding ${filename}' to get more attributes > >available to work with > > > >RouteOnAttribute to decide what to do with the file next. You can > >Fetch/delete what you don't want you can Fetch/pass on what you do > > > >I was looking for a way to check the mime-encoding while passing the > >data to detect into an input stream. because that is actually how > >execute stream command wants to work. > > > >This is a use case that should be pretty easy so if you're willing to > >chat through it with us we'll figure out a path to make it work well. > > > >Thanks > >Joe > > > >On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure > > > ><charliefras...@gmail.com> wrote: > >> Thanks Joe, > >> > >> The use case is that I'm receiving data without knowing what character > set > >> it is coming in. --mime-encoding is giving it's best guess on > character set > >> rather than the content type. > >> > >> The ListFile sounds interesting, but I wonder if I really even need > that. I > >> don't want to leave the files in place, I just want to run an external > >> command on them as part of the data flow. Is there a way I can run an > >> external command against the physical file such as > >> /opt/nifi/somedir/12345.uuid? Would that info be in an attribute > somewhere? > >> It just seems wasteful to make an extra copy of the file, in order to > run a > >> read-only command on it, then delete it. If ListFiles is still the > right > >> way to go, please let me know. > >> > >> > >> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt <joe.w...@gmail.com> wrote: > >>> > >>> For identifying the mime type you may have sufficient results with the > >>> existing processor 'IdentifyMimeType' which you can put into the flow. > >>> > >>> For better logic around identifying files to pull but first calling an > >>> external command to learn more about them the upcoming > >>> ListFile/FetchFile combo that comes from this JIRA [1] might give you > >>> better flexibility. > >>> > >>> [1] https://issues.apache.org/jira/browse/NIFI-631 > >>> > >>> Thanks > >>> Joe > >>> > >>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure > >>> <charliefras...@gmail.com> wrote: > >>> > Thanks everyone for the help. The trouble started a few processors > >>> > earlier > >>> > in an ExecuteStreamCommand on ${filename} with the result of "file > not > >>> > found". I had originally set my GetFile processor to not remove > files, > >>> > but > >>> > recently changed that. Now it seems that my ExecuteStreamCommand may > >>> > not be > >>> > the best way to accomplish this. > >>> > > >>> > The command that gets executed is: file -b --mime-encoding > ${filename} > >>> > in the working directory: ${absolute.path} > >>> > > >>> > Now that the file is no longer in the source directory when the > >>> > processor > >>> > fires, the command is broken. I could PutFile somewhere > temporarily; is > >>> > there a better way? > >>> > > >>> > On Thu, Nov 19, 2015 at 10:33 PM, Joe Witt <joe.w...@gmail.com> > wrote: > >>> >> > >>> >> Charlie, > >>> >> > >>> >> The fact that this is confusing is something we agree should be more > >>> >> clear and we will improve. We're tackling it based on what is > >>> >> mentioned here [1]. > >>> >> > >>> >> [1] > >>> >> > >>> >> > https://cwiki.apache.org/confluence/display/NIFI/Interactive+Queue+Management > >>> >> > >>> >> Thanks > >>> >> Joe > >>> >> > >>> >> On Thu, Nov 19, 2015 at 10:30 PM, Corey Flowers > >>> >> <cflow...@onyxpoint.com> > >>> >> wrote: > >>> >> > These guys are right. The file to look in for the uuid is the > >>> >> > nifi-app.log. > >>> >> > Also if you wanted to see what the processor itself was doing, you > >>> >> > could > >>> >> > right click on the processor, get its uuid and while it is > running, > >>> >> > run > >>> >> > (assuming it is on Linux): > >>> >> > > >>> >> > tail -F nifi-app.log | grep uuid > >>> >> > > >>> >> > This will just scroll the logs for that specific processor and > will > >>> >> > show > >>> >> > you > >>> >> > what it is doing. It should also tell you specific file names and > >>> >> > uuids > >>> >> > of > >>> >> > the failing files. > >>> >> > > >>> >> > Hope that helps! Have a great night and good luck! > >>> >> > > >>> >> > Sent from my iPhone > >>> >> > > >>> >> > On Nov 19, 2015, at 9:27 PM, Juan Sequeiros <helloj...@gmail.com> > >>> >> > wrote: > >>> >> > > >>> >> > You can also check the NiFi logs for a searchable id or for what > the > >>> >> > previous processor ID produced to help search provenance. > >>> >> > > >>> >> > On Nov 19, 2015 21:22, "Bryan Bende" <bbe...@gmail.com> wrote: > >>> >> >> > >>> >> >> Charlie, > >>> >> >> > >>> >> >> The behavior you described usually means that the processor > >>> >> >> encountered > >>> >> >> an > >>> >> >> unexpected error which was thrown back to the framework which > rolls > >>> >> >> back the > >>> >> >> processing of that flow file and leaves it in the queue, as > opposed > >>> >> >> to > >>> >> >> an > >>> >> >> error it expected where it would usually route to a failure > >>> >> >> relationship. > >>> >> >> > >>> >> >> Is the id that you see in the bulletin a uuid? > >>> >> >> > >>> >> >> There should still be some provenance events for this FlowFile > from > >>> >> >> the > >>> >> >> previous points in the flow. If it looks like the uuid of the > >>> >> >> FlowFile, > >>> >> >> that > >>> >> >> should be searchable from provenance using the search button on > the > >>> >> >> right. > >>> >> >> Let us know if we can help more. > >>> >> >> > >>> >> >> -Bryan > >>> >> >> > >>> >> >> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure > >>> >> >> <charliefras...@gmail.com> wrote: > >>> >> >>> > >>> >> >>> I have a question on troubleshooting a flow. I've built a flow > >>> >> >>> with > >>> >> >>> no > >>> >> >>> exception routing, just trying to process the expected values > >>> >> >>> first. > >>> >> >>> When a > >>> >> >>> file exposes a problem with the logic in my flow, it queues up > >>> >> >>> prior > >>> >> >>> to the > >>> >> >>> flow that is raising the bulletin. > >>> >> >>> > >>> >> >>> In the bulletin, I can see an id, but can't tell which file it > is. > >>> >> >>> Data > >>> >> >>> provenance doesn't seem to help as it passed the flow on the > last > >>> >> >>> processor, > >>> >> >>> but hasn't been logged (to my knowledge) on the next one. > >>> >> >>> > >>> >> >>> Is there a way to match the bulletin back to a file without > >>> >> >>> creating a > >>> >> >>> route for failed files? > >>> >> >> > >>> >> >> > >>> >> > > >>> > > >>> > > >> > >> > > >