Thanks for jumping in Lee! Mark,
This is a great writeup. We should turn this into a blog w/full explanation and template. Great use case and you just gave us a perfect user perspective/explanation of how you're thinking of it. We will make that happen quickly. https://issues.apache.org/jira/browse/NIFI-1064 Thanks Joe On Sun, Oct 25, 2015 at 9:45 AM, Mark Payne <[email protected]> wrote: > Hey Mark, > > Thanks for sharing your use case with us in pretty good details so that we > can understand > what you're trying to do here. > > There are actually a few processors coming in the next release that I think > should help here. > First, there's the FetchFile processor that you noticed in NIFI-631. > Hopefully the ListFile will > make its way in there as well because it's much easier that way :) In either > case, you can right-click > on the Processor and click Configure. If you go to the Scheduling tab, you > can change the Scheduling > Strategy to CRON-Driven and set the schedule to run whenever you'd like. > > As-is, the GetFile is expected > to remove the file from the current location, as the idea was that NiFi would > sort of assume > ownership of the file. It turns out that in the Open Source world, that's > often not desirable, so > we are moving more toward the List/Fetch pattern as described in that ticket. > > Once you pull the files into NiFI, though, UnpackContent should unzip the > files, each into its > own FlowFile. You could then use a RouteOnAttribute to pull out just the file > that you care about, > based on its filename. You can then allow the others to be routed to > Unmatched and auto-terminate > them from the flow. > > Stripping off the first line could probably be done using the ReplaceText, > but in the next version > of NiFi, we will have a RouteText processor that should make working with > CSV's far easier. You could, > for instance, route any line that begins with # to one relationship and the > rest to a second relationship. > This effectively allows you to filter out the header line. > > Finally, you can use PutHDFS and set the Compression Codec to whatever you > prefer. GZIP, Snappy, etc. > Prior to that, if you need to, you could also add in a MergeContent processor > in order to concatenate > together these CSV files in order to make them larger. > > Thanks > -Mark > > >> On Oct 25, 2015, at 12:25 AM, Mark Petronic <[email protected]> wrote: >> >> Reading some other posts, stumbled on this JIRA [1] which seems to >> directly relate to my question in this post. >> >> [1] https://issues.apache.org/jira/browse/NIFI-631 >> >> On Sat, Oct 24, 2015 at 11:44 PM, Mark Petronic <[email protected]> >> wrote: >>> So, I stumbled onto Nifi at a Laurel, MD Spark meetup and was pretty >>> excited about using it. I'm running HDP and need to construct an ETL >>> like flow and would like to try to start, as a new user to Nifi, using >>> a "best practice" approach. Wondering if some of you more seasoned >>> users might provide some thoughts on my problem? >>> >>> 1. 160 zip files/day show up on an NFS share in various sub >>> directories and their filenames contain the yyyymmddHHMMSS of when the >>> stats where generated. >>> 2. Each zip file contains 4 or more large CSV files >>> 3. I need just one of those CSVs from each zip file each day and they >>> all add up to about 10GB uncompressed >>> 4. I need to extract that one file from each zip, strip off the first >>> line (the headers), and store it in HDFS compressed again using gzip >>> or snappy >>> 5. I cannot delete the NFS file after the copy to HDFS because others >>> need access to it for some time >>> >>> So, where I am having a hard time visualizing doing this in Nifi is >>> with the first step. I need to scan the NFS files after 8 AM every day >>> (when I know all files for the previous 24 hours will be present), >>> find that set of files for that day using the yyymmdd part of file >>> names, then perform the extract of the one file I need and process it >>> into HDFS. >>> >>> I could imagine a processor that runs once every 24 hours on a cron >>> schedule. I could imaging running an ExecuteProcess processor against >>> a bash script to get the list of all the files that match the >>> yyyymmdd. Then I get stuck. How to take this list of 160 file paths >>> and start the job of processing each one of them in parallel to run >>> the ETL flow? >>> >>> Thanks in advance for any ideas >
