One of the really useful things about the Hadoop Summit and HBase meetup was hearing about what people are doing to manage, monitor and maintain the health of their systems.
So, I was just reading the Facebook SIGMOD 2011 paper <http://borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf>, and I came across this line: “Nowadays we run HBCK almost continuously against our production clusters to catch problems as early as possible.” Does anyone else do this? Can anyone from Facebook comment on what sort of problems you've caught this way? My naïve thought is that hbck would be recommended after some kind of notable failure but I wouldn't have thought it likely to turn up problems if run routinely. Maybe my perspective will be different when I get a real cluster going instead of a virtual one. It would be nice to know what to watch out for. Thanks. joe
