Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected

Mark Payne Mon, 26 Oct 2015 07:20:45 -0700

Joe,

Ultimately, we couldn't change the behavior without breaking backward 
compatibility.


We do have a ticket [1] to add an "Argument Delimiter" property that is 
completed and
will be included in 0.4.0. It will default to semi-colon in order to maintain 
backward compatibility
but it can be changed to a space. It will at least make it more obvious that 
there's a funky
delimiter being used.

Thanks
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-604 
<https://issues.apache.org/jira/browse/NIFI-604>


> On Oct 26, 2015, at 10:14 AM, Joe Witt <joe.w...@gmail.com> wrote:
> 
> Mark
> 
> Ok understood.  I think ultimately in the case of ZIP the IO is
> happening anyway but if we can avoid writing these items to our
> repositories at all if they're uninteresting then great.  Do you mind
> filing a JIRA for that?
> 
> And yes you are absolutely right that you should be able to expect/get
> a consistent behavior between executecommand/script processors.  We
> have discussed this before.  I didn't find a jira.  Anyone else know
> the status of this?
> 
> Thanks
> Joe
> 
> On Mon, Oct 26, 2015 at 1:23 AM, Mark Petronic <markpetro...@gmail.com> wrote:
>> Joe, yes, I wanted to be able to selectively unzip a specific file
>> from a zip archive. For example, I have this zip archive and want to
>> just pull all files that match *LMTD* from it to standard out as a
>> stream to feed into hdfs as a file put. Since there are a bunch of big
>> files there, it is really wasteful to network I/O to have to stream
>> the whole file file just to throw away most of the bits in a later
>> filter stage just to end up with some part of the bits. I like
>> efficiency where it makes sense and there is already a lot of I/O from
>> Hadoop - no need to add more unnecessary stuff that could be easily
>> avoided. :)
>> 
>> unzip -l 
>> /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
>> Archive:  
>> /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip
>>  Length      Date    Time    Name
>> ---------  ---------- -----   ----
>> 73166261  10-22-2015 02:17   Consolidated_LMTD_001_20151022021503.csv
>> 80864628  10-22-2015 02:17   Consolidated_MODC_001_20151022021503.csv
>> 14033836  10-22-2015 02:17   Consolidated_SYMC_001_20151022021503.csv
>>   120463  10-22-2015 02:17   Consolidated_XPRT_001_20151022021503.csv
>> ---------                     -------
>> 168185188                     4 files
>> 
>> On Sun, Oct 25, 2015 at 11:56 AM, Joe Witt <joe.w...@gmail.com> wrote:
>>> Hello
>>> 
>>> For the unpacking portion are you saying you have a single archive
>>> (let's say in zip format) and it contains multiple objects within.
>>> You'd like to be able to use UnpackContent but tell it you'd like to
>>> skip or include specific items based on a regex or something against
>>> the names?
>>> 
>>> That seems reasonable to do but just wanted to make sure I understood.
>>> For now you can put a RouteOnAttribute processor after Unpack and just
>>> route to throw away unbundled items you don't care about.  You can
>>> create a property on that processor called 'stuff-i-dont-want' and the
>>> value would be something like
>>> ${filename:matches('*stuff-i-dont-want*')}.
>>> 
>>> Thanks
>>> Joe
>>> 
>>> On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <adamond...@gmail.com> wrote:
>>>> Mark,
>>>> 
>>>>> If I configured the command arguments as
>>>> "-n +2" (without the quotes and space between the two parts), the
>>>> command would result in a "tail -n2" behavior.
>>>> 
>>>> If you look at the tooltip for the Command Arguments property in
>>>> ExecuteStreamCommand, you'll see that the arguments need to be delimited by
>>>> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in
>>>> NiFi, but I've seen similar behavior with regard to spaces in libraries 
>>>> that
>>>> execute processes with command line arguments.
>>>> 
>>>> There probably is a better way to process the CSV, but I'm afraid someone
>>>> else will need to comment on that.
>>>> 
>>>>> Seems like it will only unzip the
>>>> whole zip file and provide me index numbers for each file unpacked.
>>>> 
>>>> A quick look at the UnpackContent source [1] suggests that there is no way
>>>> to filter the filenames inside the zipfile prior to extraction. I agree 
>>>> that
>>>> would be a useful feature. Maybe one of the NiFi devs will comment on the
>>>> possibility of including it as a feature in the future.
>>>> 
>>>> Cheers,
>>>> Adam
>>>> 
>>>> 
>>>> [1]
>>>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304
>>>> 
>>>> 
>>>> 
>>>> On 10/24/15 9:08 PM, Mark Petronic wrote:
>>>>> 
>>>>> Just starting to use Nifi and built a flow that implements the following:
>>>>> 
>>>>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put -
>>>>> /some/hdfs/file
>>>>> 
>>>>> I used the following processor flow:
>>>>> 
>>>>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) ->
>>>>> CompressContent(gzip) -> PutHDFS
>>>>> 
>>>>> Couple questions/observations:
>>>>> 
>>>>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2)
>>>>> part. I need that to strip the header line off of CSV files. I did not
>>>>> see a simple way using a specific processor to strip off the first
>>>>> line of a flow file. Is there a better way? But, I did notice a very
>>>>> odd behavior of this command. If I configured the command arguments as
>>>>> "-n +2" (without the quotes and space between the two parts), the
>>>>> command would result in a "tail -n2" behavior. So, instead of giving
>>>>> me all EXCEPT the first line, I only got the last 2 lines. However,
>>>>> using "-n+2" (without the quotes and REMOVING the space) it worked as
>>>>> expected. I believe with is confusing to the user. Both forms work
>>>>> perfectly from the bash command line but only one works in Nifi?
>>>>> Anyone care to comment on this? Should there be an enhancement to
>>>>> remove this sort of inconsistent behavior?
>>>>> 
>>>>> 2. Regarding my need to unzip ONLY one specific file from the zip
>>>>> files (the one that matches *LMTD*), I did not see a way to do that
>>>>> using the UnpackContent processor. Seems like it will only unzip the
>>>>> whole zip file and provide me index numbers for each file unpacked.
>>>>> This would be quite inefficient in my case because there are a number
>>>>> of large files inside the zip file and I only need one. So, seems like
>>>>> I am doing this the preferred way but, being new to Nifi, just wanted
>>>>> to see if there are any other ideas on how to do this?
>>>>> 
>>>>> Thanks in advance for thoughts on this
>>>> 
>>>>

Re: ExecuteStreamCommand processor for "tail -n +2" not working as expected

Reply via email to