Not a problem, I'd be interested in any follow-up details. I agree that it should be a separate processor since this is an almost atomic unit of work that can be used in many different work-flows. I created a jira for this new processor: https://issues.apache.org/jira/browse/NIFI-1217
Joe - - - - - - Joseph Percivall linkedin.com/in/Percivall e: [email protected] On Tuesday, November 24, 2015 1:49 PM, Charlie Frasure <[email protected]> wrote: Interesting. Thanks for the update and the template. I use osx as a playground, but this will have to be implemented on RHEL. I'll see about downloading or building this and testing. Performance will be critical due to the volume of data; I've run into some python-based detection libraries that slowed the process way down. A related project, jchardet[1] looks interesting as a possible start for a custom processor. [1] http://jchardet.sourceforge.net/ On Tue, Nov 24, 2015 at 11:29 AM, Joe Percivall <[email protected]> wrote: Hello Charlie, > >I was looking back through and saw this wasn't totally resolved yet. > > >Couple questions. First, what system are you using? There are a couple of >options for the stream command depending on what you're using. Also are you >able to get new commands (using yum or brew)? > >The key thing I want to solve is to find the encoding of a file just based on >it contents and not relying on having access to the original file. >ExecuteStreamCommand should enable this. This is because you can just pass any >FlowFile into ExecuteStreamCommand then it can route the FlowFile contents to >STDIN for the command to execute on. > >Mac's (what I am using) default command for finding file encodings is "file >-bi filename.txt" but it doesn't allow you to pass in a file via STDIN. I >found a command called "uchardet"[1] which finds file encodings and allows you >to pass the file in via STDIN. > >I attached a template that takes in a file using GetFile (deletes the >original) and routes that FlowFile to ExecuteStreamCommand. >ExecuteStreamCommand then runs "uchardet" on the contents of the FlowFile and >outputs the encoding to the "encoding" attribute of the original FlowFile. > >[1] https://github.com/BYVoid/uchardet > >If this doesn't satisfy your needs just let me know! >Joe > >- - - - - - >Joseph Percivall >linkedin.com/in/Percivall >e: [email protected] > > > > > >On Friday, November 20, 2015 9:53 AM, Charlie Frasure ><[email protected]> wrote: > > > >I'm definitely game for that. Let me know what I can do to help. > > > >On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt <[email protected]> wrote: > >Charlie >> >>Got ya. I missed the 'encoding vs content type' thing. I agree let's >>find a way to avoid the extra copy. We dont expose the storage >>location of the underlying bytes. So on the ListFile thing. What I >>was thinking was this (and honestly I've not tested this so maybe i'm >>skipping something important) >> >>ListFile to get a listing of names/etc.. of interest >> >>Execute the 'file --mime-encoding ${filename}' to get more attributes >>available to work with >> >>RouteOnAttribute to decide what to do with the file next. You can >>Fetch/delete what you don't want you can Fetch/pass on what you do >> >>I was looking for a way to check the mime-encoding while passing the >>data to detect into an input stream. because that is actually how >>execute stream command wants to work. >> >>This is a use case that should be pretty easy so if you're willing to >>chat through it with us we'll figure out a path to make it work well. >> >>Thanks >>Joe >> >>On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure >> >><[email protected]> wrote: >>> Thanks Joe, >>> >>> The use case is that I'm receiving data without knowing what character set >>> it is coming in. --mime-encoding is giving it's best guess on character set >>> rather than the content type. >>> >>> The ListFile sounds interesting, but I wonder if I really even need that. I >>> don't want to leave the files in place, I just want to run an external >>> command on them as part of the data flow. Is there a way I can run an >>> external command against the physical file such as >>> /opt/nifi/somedir/12345.uuid? Would that info be in an attribute somewhere? >>> It just seems wasteful to make an extra copy of the file, in order to run a >>> read-only command on it, then delete it. If ListFiles is still the right >>> way to go, please let me know. >>> >>> >>> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt <[email protected]> wrote: >>>> >>>> For identifying the mime type you may have sufficient results with the >>>> existing processor 'IdentifyMimeType' which you can put into the flow. >>>> >>>> For better logic around identifying files to pull but first calling an >>>> external command to learn more about them the upcoming >>>> ListFile/FetchFile combo that comes from this JIRA [1] might give you >>>> better flexibility. >>>> >>>> [1] https://issues.apache.org/jira/browse/NIFI-631 >>>> >>>> Thanks >>>> Joe >>>> >>>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure >>>> <[email protected]> wrote: >>>> > Thanks everyone for the help. The trouble started a few processors >>>> > earlier >>>> > in an ExecuteStreamCommand on ${filename} with the result of "file not >>>> > found". I had originally set my GetFile processor to not remove files, >>>> > but >>>> > recently changed that. Now it seems that my ExecuteStreamCommand may >>>> > not be >>>> > the best way to accomplish this. >>>> > >>>> > The command that gets executed is: file -b --mime-encoding ${filename} >>>> > in the working directory: ${absolute.path} >>>> > >>>> > Now that the file is no longer in the source directory when the >>>> > processor >>>> > fires, the command is broken. I could PutFile somewhere temporarily; is >>>> > there a better way? >>>> > >>>> > On Thu, Nov 19, 2015 at 10:33 PM, Joe Witt <[email protected]> wrote: >>>> >> >>>> >> Charlie, >>>> >> >>>> >> The fact that this is confusing is something we agree should be more >>>> >> clear and we will improve. We're tackling it based on what is >>>> >> mentioned here [1]. >>>> >> >>>> >> [1] >>>> >> >>>> >> https://cwiki.apache.org/confluence/display/NIFI/Interactive+Queue+Management >>>> >> >>>> >> Thanks >>>> >> Joe >>>> >> >>>> >> On Thu, Nov 19, 2015 at 10:30 PM, Corey Flowers >>>> >> <[email protected]> >>>> >> wrote: >>>> >> > These guys are right. The file to look in for the uuid is the >>>> >> > nifi-app.log. >>>> >> > Also if you wanted to see what the processor itself was doing, you >>>> >> > could >>>> >> > right click on the processor, get its uuid and while it is running, >>>> >> > run >>>> >> > (assuming it is on Linux): >>>> >> > >>>> >> > tail -F nifi-app.log | grep uuid >>>> >> > >>>> >> > This will just scroll the logs for that specific processor and will >>>> >> > show >>>> >> > you >>>> >> > what it is doing. It should also tell you specific file names and >>>> >> > uuids >>>> >> > of >>>> >> > the failing files. >>>> >> > >>>> >> > Hope that helps! Have a great night and good luck! >>>> >> > >>>> >> > Sent from my iPhone >>>> >> > >>>> >> > On Nov 19, 2015, at 9:27 PM, Juan Sequeiros <[email protected]> >>>> >> > wrote: >>>> >> > >>>> >> > You can also check the NiFi logs for a searchable id or for what the >>>> >> > previous processor ID produced to help search provenance. >>>> >> > >>>> >> > On Nov 19, 2015 21:22, "Bryan Bende" <[email protected]> wrote: >>>> >> >> >>>> >> >> Charlie, >>>> >> >> >>>> >> >> The behavior you described usually means that the processor >>>> >> >> encountered >>>> >> >> an >>>> >> >> unexpected error which was thrown back to the framework which rolls >>>> >> >> back the >>>> >> >> processing of that flow file and leaves it in the queue, as opposed >>>> >> >> to >>>> >> >> an >>>> >> >> error it expected where it would usually route to a failure >>>> >> >> relationship. >>>> >> >> >>>> >> >> Is the id that you see in the bulletin a uuid? >>>> >> >> >>>> >> >> There should still be some provenance events for this FlowFile from >>>> >> >> the >>>> >> >> previous points in the flow. If it looks like the uuid of the >>>> >> >> FlowFile, >>>> >> >> that >>>> >> >> should be searchable from provenance using the search button on the >>>> >> >> right. >>>> >> >> Let us know if we can help more. >>>> >> >> >>>> >> >> -Bryan >>>> >> >> >>>> >> >> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure >>>> >> >> <[email protected]> wrote: >>>> >> >>> >>>> >> >>> I have a question on troubleshooting a flow. I've built a flow >>>> >> >>> with >>>> >> >>> no >>>> >> >>> exception routing, just trying to process the expected values >>>> >> >>> first. >>>> >> >>> When a >>>> >> >>> file exposes a problem with the logic in my flow, it queues up >>>> >> >>> prior >>>> >> >>> to the >>>> >> >>> flow that is raising the bulletin. >>>> >> >>> >>>> >> >>> In the bulletin, I can see an id, but can't tell which file it is. >>>> >> >>> Data >>>> >> >>> provenance doesn't seem to help as it passed the flow on the last >>>> >> >>> processor, >>>> >> >>> but hasn't been logged (to my knowledge) on the next one. >>>> >> >>> >>>> >> >>> Is there a way to match the bulletin back to a file without >>>> >> >>> creating a >>>> >> >>> route for failed files? >>>> >> >> >>>> >> >> >>>> >> > >>>> > >>>> > >>> >>> >>
