I'm trying to understand the best way of setting up repeated processing of continuously generated data - like logs.
I can manually copy files from normal FS to HDFS and kick off pig scripts but ideally I want something automatic - preferably every hour, or possibly more often. I also want to process a day or a month's worth of data rather than just the most recent file. Is there a best practice way of doing this documented anywhere? I believe that I should be looking at Flume for transferring files into HDFS and Oozie for some kind of workflow of pig jobs. Is that right? Any example setups? Cheers Alex