Reading some other posts, stumbled on this JIRA [1] which seems to
directly relate to my question in this post.

[1] https://issues.apache.org/jira/browse/NIFI-631

On Sat, Oct 24, 2015 at 11:44 PM, Mark Petronic <[email protected]> wrote:
> So, I stumbled onto Nifi at a Laurel, MD Spark meetup and was pretty
> excited about using it. I'm running HDP and need to construct an ETL
> like flow and would like to try to start, as a new user to Nifi, using
> a "best practice" approach. Wondering if some of you more seasoned
> users might provide some thoughts on my problem?
>
> 1. 160 zip files/day show up on an NFS share in various sub
> directories and their filenames contain the yyyymmddHHMMSS of when the
> stats where generated.
> 2. Each zip file contains 4 or more large CSV files
> 3. I need just one of those CSVs from each zip file each day and they
> all add up to about 10GB uncompressed
> 4. I need to extract that one file from each zip, strip off the first
> line (the headers), and store it in HDFS compressed again using gzip
> or snappy
> 5. I cannot delete the NFS file after the copy to HDFS because others
> need access to it for some time
>
> So, where I am having a hard time visualizing doing this in Nifi is
> with the first step. I need to scan the NFS files after 8 AM every day
> (when I know all files for the previous 24 hours will be present),
> find that set of files for that day using the yyymmdd part of file
> names, then perform the extract of the one file I need and process it
> into HDFS.
>
> I could imagine a processor that runs once every 24 hours on a cron
> schedule. I could imaging running an ExecuteProcess processor against
> a bash script to get the list of all the files that match the
> yyyymmdd. Then I get stuck. How to take this list of 160 file paths
> and start the job of processing each one of them in parallel to run
> the ETL flow?
>
> Thanks in advance for any ideas

Reply via email to