Re: queued files
Interesting. Thanks for the update and the template. I use osx as a playground, but this will have to be implemented on RHEL. I'll see about downloading or building this and testing. Performance will be critical due to the volume of data; I've run into some python-based detection libraries that slowed the process way down. A related project, jchardet[1] looks interesting as a possible start for a custom processor. [1] http://jchardet.sourceforge.net/ On Tue, Nov 24, 2015 at 11:29 AM, Joe Percivallwrote: > Hello Charlie, > > I was looking back through and saw this wasn't totally resolved yet. > > > Couple questions. First, what system are you using? There are a couple of > options for the stream command depending on what you're using. Also are you > able to get new commands (using yum or brew)? > > The key thing I want to solve is to find the encoding of a file just based > on it contents and not relying on having access to the original file. > ExecuteStreamCommand should enable this. This is because you can just pass > any FlowFile into ExecuteStreamCommand then it can route the FlowFile > contents to STDIN for the command to execute on. > > Mac's (what I am using) default command for finding file encodings is > "file -bi filename.txt" but it doesn't allow you to pass in a file via > STDIN. I found a command called "uchardet"[1] which finds file encodings > and allows you to pass the file in via STDIN. > > I attached a template that takes in a file using GetFile (deletes the > original) and routes that FlowFile to ExecuteStreamCommand. > ExecuteStreamCommand then runs "uchardet" on the contents of the FlowFile > and outputs the encoding to the "encoding" attribute of the original > FlowFile. > > [1] https://github.com/BYVoid/uchardet > > If this doesn't satisfy your needs just let me know! > Joe > > - - - - - - > Joseph Percivall > linkedin.com/in/Percivall > e: joeperciv...@yahoo.com > > > > > On Friday, November 20, 2015 9:53 AM, Charlie Frasure < > charliefras...@gmail.com> wrote: > > > > I'm definitely game for that. Let me know what I can do to help. > > > > On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt wrote: > > Charlie > > > >Got ya. I missed the 'encoding vs content type' thing. I agree let's > >find a way to avoid the extra copy. We dont expose the storage > >location of the underlying bytes. So on the ListFile thing. What I > >was thinking was this (and honestly I've not tested this so maybe i'm > >skipping something important) > > > >ListFile to get a listing of names/etc.. of interest > > > >Execute the 'file --mime-encoding ${filename}' to get more attributes > >available to work with > > > >RouteOnAttribute to decide what to do with the file next. You can > >Fetch/delete what you don't want you can Fetch/pass on what you do > > > >I was looking for a way to check the mime-encoding while passing the > >data to detect into an input stream. because that is actually how > >execute stream command wants to work. > > > >This is a use case that should be pretty easy so if you're willing to > >chat through it with us we'll figure out a path to make it work well. > > > >Thanks > >Joe > > > >On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure > > > > wrote: > >> Thanks Joe, > >> > >> The use case is that I'm receiving data without knowing what character > set > >> it is coming in. --mime-encoding is giving it's best guess on > character set > >> rather than the content type. > >> > >> The ListFile sounds interesting, but I wonder if I really even need > that. I > >> don't want to leave the files in place, I just want to run an external > >> command on them as part of the data flow. Is there a way I can run an > >> external command against the physical file such as > >> /opt/nifi/somedir/12345.uuid? Would that info be in an attribute > somewhere? > >> It just seems wasteful to make an extra copy of the file, in order to > run a > >> read-only command on it, then delete it. If ListFiles is still the > right > >> way to go, please let me know. > >> > >> > >> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt wrote: > >>> > >>> For identifying the mime type you may have sufficient results with the > >>> existing processor 'IdentifyMimeType' which you can put into the flow. > >>> > >>> For better logic around identifying files to pull but first calling an > >>> external command to learn more about them the upcoming > >>> ListFile/FetchFile combo that comes from this JIRA [1] might give you > >>> better flexibility. > >>> > >>> [1] https://issues.apache.org/jira/browse/NIFI-631 > >>> > >>> Thanks > >>> Joe > >>> > >>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure > >>> wrote: > >>> > Thanks everyone for the help. The trouble started a few processors > >>> > earlier > >>> > in an ExecuteStreamCommand on ${filename} with the result of "file > not > >>> > found". I had
Re: queued files
Hello Charlie, I was looking back through and saw this wasn't totally resolved yet. Couple questions. First, what system are you using? There are a couple of options for the stream command depending on what you're using. Also are you able to get new commands (using yum or brew)? The key thing I want to solve is to find the encoding of a file just based on it contents and not relying on having access to the original file. ExecuteStreamCommand should enable this. This is because you can just pass any FlowFile into ExecuteStreamCommand then it can route the FlowFile contents to STDIN for the command to execute on. Mac's (what I am using) default command for finding file encodings is "file -bi filename.txt" but it doesn't allow you to pass in a file via STDIN. I found a command called "uchardet"[1] which finds file encodings and allows you to pass the file in via STDIN. I attached a template that takes in a file using GetFile (deletes the original) and routes that FlowFile to ExecuteStreamCommand. ExecuteStreamCommand then runs "uchardet" on the contents of the FlowFile and outputs the encoding to the "encoding" attribute of the original FlowFile. [1] https://github.com/BYVoid/uchardet If this doesn't satisfy your needs just let me know! Joe - - - - - - Joseph Percivall linkedin.com/in/Percivall e: joeperciv...@yahoo.com On Friday, November 20, 2015 9:53 AM, Charlie Frasurewrote: I'm definitely game for that. Let me know what I can do to help. On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt wrote: Charlie > >Got ya. I missed the 'encoding vs content type' thing. I agree let's >find a way to avoid the extra copy. We dont expose the storage >location of the underlying bytes. So on the ListFile thing. What I >was thinking was this (and honestly I've not tested this so maybe i'm >skipping something important) > >ListFile to get a listing of names/etc.. of interest > >Execute the 'file --mime-encoding ${filename}' to get more attributes >available to work with > >RouteOnAttribute to decide what to do with the file next. You can >Fetch/delete what you don't want you can Fetch/pass on what you do > >I was looking for a way to check the mime-encoding while passing the >data to detect into an input stream. because that is actually how >execute stream command wants to work. > >This is a use case that should be pretty easy so if you're willing to >chat through it with us we'll figure out a path to make it work well. > >Thanks >Joe > >On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure > > wrote: >> Thanks Joe, >> >> The use case is that I'm receiving data without knowing what character set >> it is coming in. --mime-encoding is giving it's best guess on character set >> rather than the content type. >> >> The ListFile sounds interesting, but I wonder if I really even need that. I >> don't want to leave the files in place, I just want to run an external >> command on them as part of the data flow. Is there a way I can run an >> external command against the physical file such as >> /opt/nifi/somedir/12345.uuid? Would that info be in an attribute somewhere? >> It just seems wasteful to make an extra copy of the file, in order to run a >> read-only command on it, then delete it. If ListFiles is still the right >> way to go, please let me know. >> >> >> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt wrote: >>> >>> For identifying the mime type you may have sufficient results with the >>> existing processor 'IdentifyMimeType' which you can put into the flow. >>> >>> For better logic around identifying files to pull but first calling an >>> external command to learn more about them the upcoming >>> ListFile/FetchFile combo that comes from this JIRA [1] might give you >>> better flexibility. >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-631 >>> >>> Thanks >>> Joe >>> >>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure >>> wrote: >>> > Thanks everyone for the help. The trouble started a few processors >>> > earlier >>> > in an ExecuteStreamCommand on ${filename} with the result of "file not >>> > found". I had originally set my GetFile processor to not remove files, >>> > but >>> > recently changed that. Now it seems that my ExecuteStreamCommand may >>> > not be >>> > the best way to accomplish this. >>> > >>> > The command that gets executed is: file -b --mime-encoding ${filename} >>> > in the working directory: ${absolute.path} >>> > >>> > Now that the file is no longer in the source directory when the >>> > processor >>> > fires, the command is broken. I could PutFile somewhere temporarily; is >>> > there a better way? >>> > >>> > On Thu, Nov 19, 2015 at 10:33 PM, Joe Witt wrote: >>> >> >>> >> Charlie, >>> >> >>> >> The fact that this is confusing is something we agree should be more >>> >> clear and we will improve. We're tackling
Re: queued files
Charlie, The behavior you described usually means that the processor encountered an unexpected error which was thrown back to the framework which rolls back the processing of that flow file and leaves it in the queue, as opposed to an error it expected where it would usually route to a failure relationship. Is the id that you see in the bulletin a uuid? There should still be some provenance events for this FlowFile from the previous points in the flow. If it looks like the uuid of the FlowFile, that should be searchable from provenance using the search button on the right. Let us know if we can help more. -Bryan On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasurewrote: > I have a question on troubleshooting a flow. I've built a flow with no > exception routing, just trying to process the expected values first. When > a file exposes a problem with the logic in my flow, it queues up prior to > the flow that is raising the bulletin. > > In the bulletin, I can see an id, but can't tell which file it is. Data > provenance doesn't seem to help as it passed the flow on the last > processor, but hasn't been logged (to my knowledge) on the next one. > > Is there a way to match the bulletin back to a file without creating a > route for failed files? >
Re: queued files
Charlie, The fact that this is confusing is something we agree should be more clear and we will improve. We're tackling it based on what is mentioned here [1]. [1] https://cwiki.apache.org/confluence/display/NIFI/Interactive+Queue+Management Thanks Joe On Thu, Nov 19, 2015 at 10:30 PM, Corey Flowerswrote: > These guys are right. The file to look in for the uuid is the nifi-app.log. > Also if you wanted to see what the processor itself was doing, you could > right click on the processor, get its uuid and while it is running, run > (assuming it is on Linux): > > tail -F nifi-app.log | grep uuid > > This will just scroll the logs for that specific processor and will show you > what it is doing. It should also tell you specific file names and uuids of > the failing files. > > Hope that helps! Have a great night and good luck! > > Sent from my iPhone > > On Nov 19, 2015, at 9:27 PM, Juan Sequeiros wrote: > > You can also check the NiFi logs for a searchable id or for what the > previous processor ID produced to help search provenance. > > On Nov 19, 2015 21:22, "Bryan Bende" wrote: >> >> Charlie, >> >> The behavior you described usually means that the processor encountered an >> unexpected error which was thrown back to the framework which rolls back the >> processing of that flow file and leaves it in the queue, as opposed to an >> error it expected where it would usually route to a failure relationship. >> >> Is the id that you see in the bulletin a uuid? >> >> There should still be some provenance events for this FlowFile from the >> previous points in the flow. If it looks like the uuid of the FlowFile, that >> should be searchable from provenance using the search button on the right. >> Let us know if we can help more. >> >> -Bryan >> >> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure >> wrote: >>> >>> I have a question on troubleshooting a flow. I've built a flow with no >>> exception routing, just trying to process the expected values first. When a >>> file exposes a problem with the logic in my flow, it queues up prior to the >>> flow that is raising the bulletin. >>> >>> In the bulletin, I can see an id, but can't tell which file it is. Data >>> provenance doesn't seem to help as it passed the flow on the last processor, >>> but hasn't been logged (to my knowledge) on the next one. >>> >>> Is there a way to match the bulletin back to a file without creating a >>> route for failed files? >> >> >
queued files
I have a question on troubleshooting a flow. I've built a flow with no exception routing, just trying to process the expected values first. When a file exposes a problem with the logic in my flow, it queues up prior to the flow that is raising the bulletin. In the bulletin, I can see an id, but can't tell which file it is. Data provenance doesn't seem to help as it passed the flow on the last processor, but hasn't been logged (to my knowledge) on the next one. Is there a way to match the bulletin back to a file without creating a route for failed files?
Re: queued files
You can also check the NiFi logs for a searchable id or for what the previous processor ID produced to help search provenance. On Nov 19, 2015 21:22, "Bryan Bende"wrote: > Charlie, > > The behavior you described usually means that the processor encountered an > unexpected error which was thrown back to the framework which rolls back > the processing of that flow file and leaves it in the queue, as opposed to > an error it expected where it would usually route to a failure relationship. > > Is the id that you see in the bulletin a uuid? > > There should still be some provenance events for this FlowFile from the > previous points in the flow. If it looks like the uuid of the FlowFile, > that should be searchable from provenance using the search button on the > right. Let us know if we can help more. > > -Bryan > > On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure > wrote: > >> I have a question on troubleshooting a flow. I've built a flow with no >> exception routing, just trying to process the expected values first. When >> a file exposes a problem with the logic in my flow, it queues up prior to >> the flow that is raising the bulletin. >> >> In the bulletin, I can see an id, but can't tell which file it is. Data >> provenance doesn't seem to help as it passed the flow on the last >> processor, but hasn't been logged (to my knowledge) on the next one. >> >> Is there a way to match the bulletin back to a file without creating a >> route for failed files? >> > >
Re: queued files
These guys are right. The file to look in for the uuid is the nifi-app.log. Also if you wanted to see what the processor itself was doing, you could right click on the processor, get its uuid and while it is running, run (assuming it is on Linux): tail -F nifi-app.log | grep uuid This will just scroll the logs for that specific processor and will show you what it is doing. It should also tell you specific file names and uuids of the failing files. Hope that helps! Have a great night and good luck! Sent from my iPhone On Nov 19, 2015, at 9:27 PM, Juan Sequeiroswrote: You can also check the NiFi logs for a searchable id or for what the previous processor ID produced to help search provenance. On Nov 19, 2015 21:22, "Bryan Bende" wrote: > Charlie, > > The behavior you described usually means that the processor encountered an > unexpected error which was thrown back to the framework which rolls back > the processing of that flow file and leaves it in the queue, as opposed to > an error it expected where it would usually route to a failure relationship. > > Is the id that you see in the bulletin a uuid? > > There should still be some provenance events for this FlowFile from the > previous points in the flow. If it looks like the uuid of the FlowFile, > that should be searchable from provenance using the search button on the > right. Let us know if we can help more. > > -Bryan > > On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure > wrote: > >> I have a question on troubleshooting a flow. I've built a flow with no >> exception routing, just trying to process the expected values first. When >> a file exposes a problem with the logic in my flow, it queues up prior to >> the flow that is raising the bulletin. >> >> In the bulletin, I can see an id, but can't tell which file it is. Data >> provenance doesn't seem to help as it passed the flow on the last >> processor, but hasn't been logged (to my knowledge) on the next one. >> >> Is there a way to match the bulletin back to a file without creating a >> route for failed files? >> > >