So you have 20 nodes for the stumbled upon link redirection service? Are there any blog posts that go over the setup and what sort of read/write traffic it gets? Is there a memcached layer that sites in front?
On Tue, Nov 23, 2010 at 4:44 PM, Jean-Daniel Cryans <[email protected]>wrote: > I wish I could do a dump of my memory into an ops guide to HBase, but > currently I don't think there's such a writeup. > > What can go wrong... again it depends on your type of usage. With a > MR-heavy cluster, it's usually very easy to drive the IO wait through > the roof and then you'll end up with GC pauses >60 secs caused by CPU > starvation. Here's a recent example we got when a big Mahout job was > running: > > 2010-11-19T18:25:31.173-0800: [GC [ParNew: 114456K->13056K(118016K), > 103.8190010 secs] 4624541K->4535473K(7154944K), 104.7165690 secs] > [Times: user=4.45 sys=2.02, real=104.72 secs] > > The trained eye will quickly see that something very bad happened on > that cluster. Indeed, during post-mortem we saw that somehow that > machine started swapping which is the Worst Thing Ever (tm) that can > happen to a machine that runs java processes. Make sure that your > memory usage always stay under your total memory, even when all the > mappers and reducers are using their heap at the fullest. And then > double check that (which it seems we didn't do). > > On a cluster that serves web traffic, and thus must not be MRed > against, you get the "usual" stuff like bad disks and operator errors. > > J-D > > On Tue, Nov 23, 2010 at 1:31 PM, S Ahmed <[email protected]> wrote: > > Are there any writeups on what things to look for? > > > > What are some of the things that usually go wrong? Or is that an unfair > > question :) > > > > On Tue, Nov 23, 2010 at 4:22 PM, Jean-Daniel Cryans <[email protected] > >wrote: > > > >> Constant hand holding no, constant monitoring yes. Do setup Ganglia > >> and preferably Nagios. Then it depends what you're planning to do with > >> your cluster... here we have 2x 20 machines in production, the one > >> that serves live traffic is pretty much doing it's own thing by itself > >> (although I keep a ganglia tab opened on a second monitor) and the > >> other one is used strictly for MapReduce for which our internal users > >> have developed a habit of running very destructive jobs on. But to be > >> fair, it's probably the users that need support the most ;) > >> > >> J-D > >> > >> On Tue, Nov 23, 2010 at 1:14 PM, S Ahmed <[email protected]> wrote: > >> > Hi, > >> > > >> > How much of a guru do you have to be to keep say 5-10 servers humming? > >> > > >> > I'm a 1-man shop, and I dream of developing a web application, and > >> scaling > >> > will be a core part of the application. > >> > > >> > Is it feasable for a 1-man operation to manage a 5-10 server hbase > >> cluster? > >> > Is it something that requires hand holding and constant monitoring or > it > >> > tends to be hands off? > >> > > >> > > >
