Hi all,

I'm using flume to collect tweets, and I want to process the files
generated by flume as soon as possible after they arrive.

What is the best way to achieve this?

This is the best explanation of the different ways I have seen so far:
https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases

Flume can generate data directories (based on e.g. hour, minute, etc) but
my reading is that oozie will try and process it the moment the directory
appears. I'm not sure basing it on the files appearing would work any
better, either (unless it's possible to use wild cards in the file name?)

It's also quite possible more data will arrive while the workflow is
executing, so that needs to be handled appropriately without skipping or
re-processing data.

Any advice or links to tutorials/blogs/guides would be appreciated!

Thanks,
Charles

Reply via email to