Reading some other posts, stumbled on this JIRA [1] which seems to directly relate to my question in this post.
[1] https://issues.apache.org/jira/browse/NIFI-631 On Sat, Oct 24, 2015 at 11:44 PM, Mark Petronic <[email protected]> wrote: > So, I stumbled onto Nifi at a Laurel, MD Spark meetup and was pretty > excited about using it. I'm running HDP and need to construct an ETL > like flow and would like to try to start, as a new user to Nifi, using > a "best practice" approach. Wondering if some of you more seasoned > users might provide some thoughts on my problem? > > 1. 160 zip files/day show up on an NFS share in various sub > directories and their filenames contain the yyyymmddHHMMSS of when the > stats where generated. > 2. Each zip file contains 4 or more large CSV files > 3. I need just one of those CSVs from each zip file each day and they > all add up to about 10GB uncompressed > 4. I need to extract that one file from each zip, strip off the first > line (the headers), and store it in HDFS compressed again using gzip > or snappy > 5. I cannot delete the NFS file after the copy to HDFS because others > need access to it for some time > > So, where I am having a hard time visualizing doing this in Nifi is > with the first step. I need to scan the NFS files after 8 AM every day > (when I know all files for the previous 24 hours will be present), > find that set of files for that day using the yyymmdd part of file > names, then perform the extract of the one file I need and process it > into HDFS. > > I could imagine a processor that runs once every 24 hours on a cron > schedule. I could imaging running an ExecuteProcess processor against > a bash script to get the list of all the files that match the > yyyymmdd. Then I get stuck. How to take this list of 160 file paths > and start the job of processing each one of them in parallel to run > the ETL flow? > > Thanks in advance for any ideas
