Hi all, I'm using flume to collect tweets, and I want to process the files generated by flume as soon as possible after they arrive.
What is the best way to achieve this? This is the best explanation of the different ways I have seen so far: https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases Flume can generate data directories (based on e.g. hour, minute, etc) but my reading is that oozie will try and process it the moment the directory appears. I'm not sure basing it on the files appearing would work any better, either (unless it's possible to use wild cards in the file name?) It's also quite possible more data will arrive while the workflow is executing, so that needs to be handled appropriately without skipping or re-processing data. Any advice or links to tutorials/blogs/guides would be appreciated! Thanks, Charles
