Is there a "Best Practice" way of monitoring the tasks started by Oozie - but in a data centric way?
eg If I am sqooping data every day using Oozie coordinators to specify a day then I might want to check that I fetched a similar number of records as yesterday. If I am populating a Hive partition every day using Oozie then I might want to check that the new partition exists - and has a sensible looking number of records. The best I can come up with so far is a) shell scripts which are kicked off by oozie at the end of my current workflow. b) Possibly add a conditional which emails me if there was an error condition Other ideas include doing my data monitoring in a pig script and writing its results to either a file somewhere - or writing it directly to some other monitoring tool using a custom SerDe But then I wonder what that monitoring tool should be. Typically we have Ganglia and Nagios - so those are good starts - but they are very much geared towards hardware and network monitoring rather than application monitoring? Is there a really obvious tool or setup i should be following to make it clear that my batch Oozie bundles really have worked? Alex
