Joe, Ultimately, we couldn't change the behavior without breaking backward compatibility.
We do have a ticket [1] to add an "Argument Delimiter" property that is completed and will be included in 0.4.0. It will default to semi-colon in order to maintain backward compatibility but it can be changed to a space. It will at least make it more obvious that there's a funky delimiter being used. Thanks -Mark [1] https://issues.apache.org/jira/browse/NIFI-604 <https://issues.apache.org/jira/browse/NIFI-604> > On Oct 26, 2015, at 10:14 AM, Joe Witt <joe.w...@gmail.com> wrote: > > Mark > > Ok understood. I think ultimately in the case of ZIP the IO is > happening anyway but if we can avoid writing these items to our > repositories at all if they're uninteresting then great. Do you mind > filing a JIRA for that? > > And yes you are absolutely right that you should be able to expect/get > a consistent behavior between executecommand/script processors. We > have discussed this before. I didn't find a jira. Anyone else know > the status of this? > > Thanks > Joe > > On Mon, Oct 26, 2015 at 1:23 AM, Mark Petronic <markpetro...@gmail.com> wrote: >> Joe, yes, I wanted to be able to selectively unzip a specific file >> from a zip archive. For example, I have this zip archive and want to >> just pull all files that match *LMTD* from it to standard out as a >> stream to feed into hdfs as a file put. Since there are a bunch of big >> files there, it is really wasteful to network I/O to have to stream >> the whole file file just to throw away most of the bits in a later >> filter stage just to end up with some part of the bits. I like >> efficiency where it makes sense and there is already a lot of I/O from >> Hadoop - no need to add more unnecessary stuff that could be easily >> avoided. :) >> >> unzip -l >> /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip >> Archive: >> /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip >> Length Date Time Name >> --------- ---------- ----- ---- >> 73166261 10-22-2015 02:17 Consolidated_LMTD_001_20151022021503.csv >> 80864628 10-22-2015 02:17 Consolidated_MODC_001_20151022021503.csv >> 14033836 10-22-2015 02:17 Consolidated_SYMC_001_20151022021503.csv >> 120463 10-22-2015 02:17 Consolidated_XPRT_001_20151022021503.csv >> --------- ------- >> 168185188 4 files >> >> On Sun, Oct 25, 2015 at 11:56 AM, Joe Witt <joe.w...@gmail.com> wrote: >>> Hello >>> >>> For the unpacking portion are you saying you have a single archive >>> (let's say in zip format) and it contains multiple objects within. >>> You'd like to be able to use UnpackContent but tell it you'd like to >>> skip or include specific items based on a regex or something against >>> the names? >>> >>> That seems reasonable to do but just wanted to make sure I understood. >>> For now you can put a RouteOnAttribute processor after Unpack and just >>> route to throw away unbundled items you don't care about. You can >>> create a property on that processor called 'stuff-i-dont-want' and the >>> value would be something like >>> ${filename:matches('*stuff-i-dont-want*')}. >>> >>> Thanks >>> Joe >>> >>> On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <adamond...@gmail.com> wrote: >>>> Mark, >>>> >>>>> If I configured the command arguments as >>>> "-n +2" (without the quotes and space between the two parts), the >>>> command would result in a "tail -n2" behavior. >>>> >>>> If you look at the tooltip for the Command Arguments property in >>>> ExecuteStreamCommand, you'll see that the arguments need to be delimited by >>>> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in >>>> NiFi, but I've seen similar behavior with regard to spaces in libraries >>>> that >>>> execute processes with command line arguments. >>>> >>>> There probably is a better way to process the CSV, but I'm afraid someone >>>> else will need to comment on that. >>>> >>>>> Seems like it will only unzip the >>>> whole zip file and provide me index numbers for each file unpacked. >>>> >>>> A quick look at the UnpackContent source [1] suggests that there is no way >>>> to filter the filenames inside the zipfile prior to extraction. I agree >>>> that >>>> would be a useful feature. Maybe one of the NiFi devs will comment on the >>>> possibility of including it as a feature in the future. >>>> >>>> Cheers, >>>> Adam >>>> >>>> >>>> [1] >>>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304 >>>> >>>> >>>> >>>> On 10/24/15 9:08 PM, Mark Petronic wrote: >>>>> >>>>> Just starting to use Nifi and built a flow that implements the following: >>>>> >>>>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put - >>>>> /some/hdfs/file >>>>> >>>>> I used the following processor flow: >>>>> >>>>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) -> >>>>> CompressContent(gzip) -> PutHDFS >>>>> >>>>> Couple questions/observations: >>>>> >>>>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2) >>>>> part. I need that to strip the header line off of CSV files. I did not >>>>> see a simple way using a specific processor to strip off the first >>>>> line of a flow file. Is there a better way? But, I did notice a very >>>>> odd behavior of this command. If I configured the command arguments as >>>>> "-n +2" (without the quotes and space between the two parts), the >>>>> command would result in a "tail -n2" behavior. So, instead of giving >>>>> me all EXCEPT the first line, I only got the last 2 lines. However, >>>>> using "-n+2" (without the quotes and REMOVING the space) it worked as >>>>> expected. I believe with is confusing to the user. Both forms work >>>>> perfectly from the bash command line but only one works in Nifi? >>>>> Anyone care to comment on this? Should there be an enhancement to >>>>> remove this sort of inconsistent behavior? >>>>> >>>>> 2. Regarding my need to unzip ONLY one specific file from the zip >>>>> files (the one that matches *LMTD*), I did not see a way to do that >>>>> using the UnpackContent processor. Seems like it will only unzip the >>>>> whole zip file and provide me index numbers for each file unpacked. >>>>> This would be quite inefficient in my case because there are a number >>>>> of large files inside the zip file and I only need one. So, seems like >>>>> I am doing this the preferred way but, being new to Nifi, just wanted >>>>> to see if there are any other ideas on how to do this? >>>>> >>>>> Thanks in advance for thoughts on this >>>> >>>>