Joe, yes, I wanted to be able to selectively unzip a specific file from a zip archive. For example, I have this zip archive and want to just pull all files that match *LMTD* from it to standard out as a stream to feed into hdfs as a file put. Since there are a bunch of big files there, it is really wasteful to network I/O to have to stream the whole file file just to throw away most of the bits in a later filter stage just to end up with some part of the bits. I like efficiency where it makes sense and there is already a lot of I/O from Hadoop - no need to add more unnecessary stuff that could be easily avoided. :)
unzip -l /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip Archive: /import/nms/prod/stats/Terminal/GW12/ConsolidatedTermStats_20151022021503.zip Length Date Time Name --------- ---------- ----- ---- 73166261 10-22-2015 02:17 Consolidated_LMTD_001_20151022021503.csv 80864628 10-22-2015 02:17 Consolidated_MODC_001_20151022021503.csv 14033836 10-22-2015 02:17 Consolidated_SYMC_001_20151022021503.csv 120463 10-22-2015 02:17 Consolidated_XPRT_001_20151022021503.csv --------- ------- 168185188 4 files On Sun, Oct 25, 2015 at 11:56 AM, Joe Witt <[email protected]> wrote: > Hello > > For the unpacking portion are you saying you have a single archive > (let's say in zip format) and it contains multiple objects within. > You'd like to be able to use UnpackContent but tell it you'd like to > skip or include specific items based on a regex or something against > the names? > > That seems reasonable to do but just wanted to make sure I understood. > For now you can put a RouteOnAttribute processor after Unpack and just > route to throw away unbundled items you don't care about. You can > create a property on that processor called 'stuff-i-dont-want' and the > value would be something like > ${filename:matches('*stuff-i-dont-want*')}. > > Thanks > Joe > > On Sun, Oct 25, 2015 at 1:12 AM, Adam Lamar <[email protected]> wrote: >> Mark, >> >>> If I configured the command arguments as >> "-n +2" (without the quotes and space between the two parts), the >> command would result in a "tail -n2" behavior. >> >> If you look at the tooltip for the Command Arguments property in >> ExecuteStreamCommand, you'll see that the arguments need to be delimited by >> a semicolon. Maybe try "-n;+2" instead? I'm not sure the exact rules in >> NiFi, but I've seen similar behavior with regard to spaces in libraries that >> execute processes with command line arguments. >> >> There probably is a better way to process the CSV, but I'm afraid someone >> else will need to comment on that. >> >>> Seems like it will only unzip the >> whole zip file and provide me index numbers for each file unpacked. >> >> A quick look at the UnpackContent source [1] suggests that there is no way >> to filter the filenames inside the zipfile prior to extraction. I agree that >> would be a useful feature. Maybe one of the NiFi devs will comment on the >> possibility of including it as a feature in the future. >> >> Cheers, >> Adam >> >> >> [1] >> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/UnpackContent.java#L304 >> >> >> >> On 10/24/15 9:08 PM, Mark Petronic wrote: >>> >>> Just starting to use Nifi and built a flow that implements the following: >>> >>> unzip -p my.zip *LMTD* | tail -n +2 | gzip --fast | hdfs dfs -put - >>> /some/hdfs/file >>> >>> I used the following processor flow: >>> >>> ExecuteProcess(unzip -p) -> ExecuteStreamCommand(tail -n +2) -> >>> CompressContent(gzip) -> PutHDFS >>> >>> Couple questions/observations: >>> >>> 1. I got hung up for awhile on the ExecuteStreamCommand(tail -n +2) >>> part. I need that to strip the header line off of CSV files. I did not >>> see a simple way using a specific processor to strip off the first >>> line of a flow file. Is there a better way? But, I did notice a very >>> odd behavior of this command. If I configured the command arguments as >>> "-n +2" (without the quotes and space between the two parts), the >>> command would result in a "tail -n2" behavior. So, instead of giving >>> me all EXCEPT the first line, I only got the last 2 lines. However, >>> using "-n+2" (without the quotes and REMOVING the space) it worked as >>> expected. I believe with is confusing to the user. Both forms work >>> perfectly from the bash command line but only one works in Nifi? >>> Anyone care to comment on this? Should there be an enhancement to >>> remove this sort of inconsistent behavior? >>> >>> 2. Regarding my need to unzip ONLY one specific file from the zip >>> files (the one that matches *LMTD*), I did not see a way to do that >>> using the UnpackContent processor. Seems like it will only unzip the >>> whole zip file and provide me index numbers for each file unpacked. >>> This would be quite inefficient in my case because there are a number >>> of large files inside the zip file and I only need one. So, seems like >>> I am doing this the preferred way but, being new to Nifi, just wanted >>> to see if there are any other ideas on how to do this? >>> >>> Thanks in advance for thoughts on this >> >>
