Re: queued files

2015-11-24 Thread Charlie Frasure
Interesting.  Thanks for the update and the template.  I use osx as a
playground, but this will have to be implemented on RHEL.  I'll see about
downloading or building this and testing.  Performance will be critical due
to the volume of data; I've run into some python-based detection libraries
that slowed the process way down.

A related project, jchardet[1] looks interesting as a possible start for a
custom processor.

[1] http://jchardet.sourceforge.net/



On Tue, Nov 24, 2015 at 11:29 AM, Joe Percivall 
wrote:

> Hello Charlie,
>
> I was looking back through and saw this wasn't totally resolved yet.
>
>
> Couple questions. First, what system are you using? There are a couple of
> options for the stream command depending on what you're using. Also are you
> able to get new commands (using yum or brew)?
>
> The key thing I want to solve is to find the encoding of a file just based
> on it contents and not relying on having access to the original file.
> ExecuteStreamCommand should enable this. This is because you can just pass
> any FlowFile into ExecuteStreamCommand then it can route the FlowFile
> contents to STDIN for the command to execute on.
>
> Mac's (what I am using) default command for finding file encodings is
> "file -bi filename.txt" but it doesn't allow you to pass in a file via
> STDIN. I found a command called "uchardet"[1] which finds file encodings
> and allows you to pass the file in via STDIN.
>
> I attached a template that takes in a file using GetFile (deletes the
> original) and routes that FlowFile to ExecuteStreamCommand.
> ExecuteStreamCommand then runs "uchardet" on the contents of the FlowFile
> and outputs the encoding to the "encoding" attribute of the original
> FlowFile.
>
> [1] https://github.com/BYVoid/uchardet
>
> If this doesn't satisfy your needs just let me know!
> Joe
>
> - - - - - -
> Joseph Percivall
> linkedin.com/in/Percivall
> e: joeperciv...@yahoo.com
>
>
>
>
> On Friday, November 20, 2015 9:53 AM, Charlie Frasure <
> charliefras...@gmail.com> wrote:
>
>
>
> I'm definitely game for that.  Let me know what I can do to help.
>
>
>
> On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt  wrote:
>
> Charlie
> >
> >Got ya.  I missed the 'encoding vs content type' thing.  I agree let's
> >find a way to avoid the extra copy.  We dont expose the storage
> >location of the underlying bytes.  So on the ListFile thing.  What I
> >was thinking was this (and honestly I've not tested this so maybe i'm
> >skipping something important)
> >
> >ListFile to get a listing of names/etc.. of interest
> >
> >Execute the 'file --mime-encoding ${filename}' to get more attributes
> >available to work with
> >
> >RouteOnAttribute to decide what to do with the file next.  You can
> >Fetch/delete what you don't want you can Fetch/pass on what you do
> >
> >I was looking for a way to check the mime-encoding while passing the
> >data to detect into an input stream.  because that is actually how
> >execute stream command wants to work.
> >
> >This is a use case that should be pretty easy so if you're willing to
> >chat through it with us we'll figure out a path to make it work well.
> >
> >Thanks
> >Joe
> >
> >On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure
> >
> > wrote:
> >> Thanks Joe,
> >>
> >> The use case is that I'm receiving data without knowing what character
> set
> >> it is coming in.  --mime-encoding is giving it's best guess on
> character set
> >> rather than the content type.
> >>
> >> The ListFile sounds interesting, but I wonder if I really even need
> that.  I
> >> don't want to leave the files in place, I just want to run an external
> >> command on them as part of the data flow.  Is there a way I can run an
> >> external command against the physical file such as
> >> /opt/nifi/somedir/12345.uuid?  Would that info be in an attribute
> somewhere?
> >> It just seems wasteful to make an extra copy of the file, in order to
> run a
> >> read-only command on it, then delete it.  If ListFiles is still the
> right
> >> way to go, please let me know.
> >>
> >>
> >> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt  wrote:
> >>>
> >>> For identifying the mime type you may have sufficient results with the
> >>> existing processor 'IdentifyMimeType' which you can put into the flow.
> >>>
> >>> For better logic around identifying files to pull but first calling an
> >>> external command to learn more about them the upcoming
> >>> ListFile/FetchFile combo that comes from this JIRA [1] might give you
> >>> better flexibility.
> >>>
> >>> [1] https://issues.apache.org/jira/browse/NIFI-631
> >>>
> >>> Thanks
> >>> Joe
> >>>
> >>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure
> >>>  wrote:
> >>> > Thanks everyone for the help.  The trouble started a few processors
> >>> > earlier
> >>> > in an ExecuteStreamCommand on ${filename} with the result of "file
> not
> >>> > found".  I had 

Re: queued files

2015-11-24 Thread Joe Percivall
Hello Charlie,

I was looking back through and saw this wasn't totally resolved yet. 


Couple questions. First, what system are you using? There are a couple of 
options for the stream command depending on what you're using. Also are you 
able to get new commands (using yum or brew)?

The key thing I want to solve is to find the encoding of a file just based on 
it contents and not relying on having access to the original file. 
ExecuteStreamCommand should enable this. This is because you can just pass any 
FlowFile into ExecuteStreamCommand then it can route the FlowFile contents to 
STDIN for the command to execute on.

Mac's (what I am using) default command for finding file encodings is "file -bi 
filename.txt" but it doesn't allow you to pass in a file via STDIN. I found a 
command called "uchardet"[1] which finds file encodings and allows you to pass 
the file in via STDIN. 

I attached a template that takes in a file using GetFile (deletes the original) 
and routes that FlowFile to ExecuteStreamCommand. ExecuteStreamCommand then 
runs "uchardet" on the contents of the FlowFile and outputs the encoding to the 
"encoding" attribute of the original FlowFile.
 
[1] https://github.com/BYVoid/uchardet

If this doesn't satisfy your needs just let me know!
Joe

- - - - - - 
Joseph Percivall
linkedin.com/in/Percivall
e: joeperciv...@yahoo.com




On Friday, November 20, 2015 9:53 AM, Charlie Frasure 
 wrote:



I'm definitely game for that.  Let me know what I can do to help.



On Fri, Nov 20, 2015 at 9:35 AM, Joe Witt  wrote:

Charlie
>
>Got ya.  I missed the 'encoding vs content type' thing.  I agree let's
>find a way to avoid the extra copy.  We dont expose the storage
>location of the underlying bytes.  So on the ListFile thing.  What I
>was thinking was this (and honestly I've not tested this so maybe i'm
>skipping something important)
>
>ListFile to get a listing of names/etc.. of interest
>
>Execute the 'file --mime-encoding ${filename}' to get more attributes
>available to work with
>
>RouteOnAttribute to decide what to do with the file next.  You can
>Fetch/delete what you don't want you can Fetch/pass on what you do
>
>I was looking for a way to check the mime-encoding while passing the
>data to detect into an input stream.  because that is actually how
>execute stream command wants to work.
>
>This is a use case that should be pretty easy so if you're willing to
>chat through it with us we'll figure out a path to make it work well.
>
>Thanks
>Joe
>
>On Fri, Nov 20, 2015 at 9:17 AM, Charlie Frasure
>
> wrote:
>> Thanks Joe,
>>
>> The use case is that I'm receiving data without knowing what character set
>> it is coming in.  --mime-encoding is giving it's best guess on character set
>> rather than the content type.
>>
>> The ListFile sounds interesting, but I wonder if I really even need that.  I
>> don't want to leave the files in place, I just want to run an external
>> command on them as part of the data flow.  Is there a way I can run an
>> external command against the physical file such as
>> /opt/nifi/somedir/12345.uuid?  Would that info be in an attribute somewhere?
>> It just seems wasteful to make an extra copy of the file, in order to run a
>> read-only command on it, then delete it.  If ListFiles is still the right
>> way to go, please let me know.
>>
>>
>> On Fri, Nov 20, 2015 at 6:45 AM, Joe Witt  wrote:
>>>
>>> For identifying the mime type you may have sufficient results with the
>>> existing processor 'IdentifyMimeType' which you can put into the flow.
>>>
>>> For better logic around identifying files to pull but first calling an
>>> external command to learn more about them the upcoming
>>> ListFile/FetchFile combo that comes from this JIRA [1] might give you
>>> better flexibility.
>>>
>>> [1] https://issues.apache.org/jira/browse/NIFI-631
>>>
>>> Thanks
>>> Joe
>>>
>>> On Fri, Nov 20, 2015 at 12:08 AM, Charlie Frasure
>>>  wrote:
>>> > Thanks everyone for the help.  The trouble started a few processors
>>> > earlier
>>> > in an ExecuteStreamCommand on ${filename} with the result of "file not
>>> > found".  I had originally set my GetFile processor to not remove files,
>>> > but
>>> > recently changed that.  Now it seems that my ExecuteStreamCommand may
>>> > not be
>>> > the best way to accomplish this.
>>> >
>>> > The command that gets executed is: file -b --mime-encoding ${filename}
>>> > in the working directory: ${absolute.path}
>>> >
>>> > Now that the file is no longer in the source directory when the
>>> > processor
>>> > fires, the command is broken.  I could PutFile somewhere temporarily; is
>>> > there a better way?
>>> >
>>> > On Thu, Nov 19, 2015 at 10:33 PM, Joe Witt  wrote:
>>> >>
>>> >> Charlie,
>>> >>
>>> >> The fact that this is confusing is something we agree should be more
>>> >> clear and we will improve.  We're tackling 

Re: queued files

2015-11-19 Thread Bryan Bende
Charlie,

The behavior you described usually means that the processor encountered an
unexpected error which was thrown back to the framework which rolls back
the processing of that flow file and leaves it in the queue, as opposed to
an error it expected where it would usually route to a failure relationship.

Is the id that you see in the bulletin a uuid?

There should still be some provenance events for this FlowFile from the
previous points in the flow. If it looks like the uuid of the FlowFile,
that should be searchable from provenance using the search button on the
right. Let us know if we can help more.

-Bryan

On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure 
wrote:

> I have a question on troubleshooting a flow.  I've built a flow with no
> exception routing, just trying to process the expected values first.  When
> a file exposes a problem with the logic in my flow, it queues up prior to
> the flow that is raising the bulletin.
>
> In the bulletin, I can see an id, but can't tell which file it is.  Data
> provenance doesn't seem to help as it passed the flow on the last
> processor, but hasn't been logged (to my knowledge) on the next one.
>
> Is there a way to match the bulletin back to a file without creating a
> route for failed files?
>


Re: queued files

2015-11-19 Thread Joe Witt
Charlie,

The fact that this is confusing is something we agree should be more
clear and we will improve.  We're tackling it based on what is
mentioned here [1].

[1] 
https://cwiki.apache.org/confluence/display/NIFI/Interactive+Queue+Management

Thanks
Joe

On Thu, Nov 19, 2015 at 10:30 PM, Corey Flowers  wrote:
> These guys are right. The file to look in for the uuid is the nifi-app.log.
> Also if you wanted to see what the processor itself was doing, you could
> right click on the processor, get its uuid and while it is running, run
> (assuming it is on Linux):
>
> tail -F nifi-app.log | grep uuid
>
> This will just scroll the logs for that specific processor and will show you
> what it is doing. It should also tell you specific file names and uuids of
> the failing files.
>
> Hope that helps! Have a great night and good luck!
>
> Sent from my iPhone
>
> On Nov 19, 2015, at 9:27 PM, Juan Sequeiros  wrote:
>
> You can also check the NiFi logs for a searchable id or for what the
> previous processor ID produced to help search provenance.
>
> On Nov 19, 2015 21:22, "Bryan Bende"  wrote:
>>
>> Charlie,
>>
>> The behavior you described usually means that the processor encountered an
>> unexpected error which was thrown back to the framework which rolls back the
>> processing of that flow file and leaves it in the queue, as opposed to an
>> error it expected where it would usually route to a failure relationship.
>>
>> Is the id that you see in the bulletin a uuid?
>>
>> There should still be some provenance events for this FlowFile from the
>> previous points in the flow. If it looks like the uuid of the FlowFile, that
>> should be searchable from provenance using the search button on the right.
>> Let us know if we can help more.
>>
>> -Bryan
>>
>> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure
>>  wrote:
>>>
>>> I have a question on troubleshooting a flow.  I've built a flow with no
>>> exception routing, just trying to process the expected values first.  When a
>>> file exposes a problem with the logic in my flow, it queues up prior to the
>>> flow that is raising the bulletin.
>>>
>>> In the bulletin, I can see an id, but can't tell which file it is.  Data
>>> provenance doesn't seem to help as it passed the flow on the last processor,
>>> but hasn't been logged (to my knowledge) on the next one.
>>>
>>> Is there a way to match the bulletin back to a file without creating a
>>> route for failed files?
>>
>>
>


queued files

2015-11-19 Thread Charlie Frasure
I have a question on troubleshooting a flow.  I've built a flow with no
exception routing, just trying to process the expected values first.  When
a file exposes a problem with the logic in my flow, it queues up prior to
the flow that is raising the bulletin.

In the bulletin, I can see an id, but can't tell which file it is.  Data
provenance doesn't seem to help as it passed the flow on the last
processor, but hasn't been logged (to my knowledge) on the next one.

Is there a way to match the bulletin back to a file without creating a
route for failed files?


Re: queued files

2015-11-19 Thread Juan Sequeiros
You can also check the NiFi logs for a searchable id or for what the
previous processor ID produced to help search provenance.
On Nov 19, 2015 21:22, "Bryan Bende"  wrote:

> Charlie,
>
> The behavior you described usually means that the processor encountered an
> unexpected error which was thrown back to the framework which rolls back
> the processing of that flow file and leaves it in the queue, as opposed to
> an error it expected where it would usually route to a failure relationship.
>
> Is the id that you see in the bulletin a uuid?
>
> There should still be some provenance events for this FlowFile from the
> previous points in the flow. If it looks like the uuid of the FlowFile,
> that should be searchable from provenance using the search button on the
> right. Let us know if we can help more.
>
> -Bryan
>
> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure  > wrote:
>
>> I have a question on troubleshooting a flow.  I've built a flow with no
>> exception routing, just trying to process the expected values first.  When
>> a file exposes a problem with the logic in my flow, it queues up prior to
>> the flow that is raising the bulletin.
>>
>> In the bulletin, I can see an id, but can't tell which file it is.  Data
>> provenance doesn't seem to help as it passed the flow on the last
>> processor, but hasn't been logged (to my knowledge) on the next one.
>>
>> Is there a way to match the bulletin back to a file without creating a
>> route for failed files?
>>
>
>


Re: queued files

2015-11-19 Thread Corey Flowers
These guys are right. The file to look in for the uuid is the nifi-app.log.
Also if you wanted to see what the processor itself was doing, you could
right click on the processor, get its uuid and while it is running, run
(assuming it is on Linux):

tail -F nifi-app.log | grep uuid

This will just scroll the logs for that specific processor and will show
you what it is doing. It should also tell you specific file names and uuids
of the failing files.

Hope that helps! Have a great night and good luck!

Sent from my iPhone

On Nov 19, 2015, at 9:27 PM, Juan Sequeiros  wrote:

You can also check the NiFi logs for a searchable id or for what the
previous processor ID produced to help search provenance.
On Nov 19, 2015 21:22, "Bryan Bende"  wrote:

> Charlie,
>
> The behavior you described usually means that the processor encountered an
> unexpected error which was thrown back to the framework which rolls back
> the processing of that flow file and leaves it in the queue, as opposed to
> an error it expected where it would usually route to a failure relationship.
>
> Is the id that you see in the bulletin a uuid?
>
> There should still be some provenance events for this FlowFile from the
> previous points in the flow. If it looks like the uuid of the FlowFile,
> that should be searchable from provenance using the search button on the
> right. Let us know if we can help more.
>
> -Bryan
>
> On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure  > wrote:
>
>> I have a question on troubleshooting a flow.  I've built a flow with no
>> exception routing, just trying to process the expected values first.  When
>> a file exposes a problem with the logic in my flow, it queues up prior to
>> the flow that is raising the bulletin.
>>
>> In the bulletin, I can see an id, but can't tell which file it is.  Data
>> provenance doesn't seem to help as it passed the flow on the last
>> processor, but hasn't been logged (to my knowledge) on the next one.
>>
>> Is there a way to match the bulletin back to a file without creating a
>> route for failed files?
>>
>
>