Hi Placido, Sounds like it might be related to HDFS-2379. Try updating to Hadoop 1.0.1 or CDH3u3 and you'll get a fix for that.
You can verify by grepping for "BlockReport" in your DN logs - if the pauses on the hbase side correlate with long block reports on the DNs, the upgrade should fix it. -Todd On Wed, Apr 4, 2012 at 2:30 AM, Placido Revilla <[email protected]> wrote: > Hi, > > I'm having a problem with sync latency on our HBase cluster. Our cluster is > composed of 2 NN (and HBase master) machines and 12 DN (and HBase > regionservers and thrift servers). We are having several issues a day where > the cluster seems to halt all processing during several seconds, and these > times are aligned with some WARN logs: > > 13:23:55,285 WARN [IPC Server handler 62 on 60020] wal.HLog IPC Server > handler 62 on 60020 took 10713 ms appending an edit to hlog; > editcount=150694, len~=58.0 > 13:23:55,286 WARN [IPC Server handler 64 on 60020] wal.HLog IPC Server > handler 64 on 60020 took 10726 ms appending an edit to hlog; > editcount=319217, len~=47.0 > 13:23:55,286 WARN [IPC Server handler 118 on 60020] wal.HLog IPC Server > handler 118 on 60020 took 10741 ms appending an edit to hlog; > editcount=373337, len~=49.0 > 13:23:55,286 WARN [IPC Server handler 113 on 60020] wal.HLog IPC Server > handler 113 on 60020 took 10746 ms appending an edit to hlog; > editcount=57912, len~=45.0 > 15:39:38,193 WARN [IPC Server handler 94 on 60020] wal.HLog IPC Server > handler 94 on 60020 took 21787 ms appending an edit to hlog; > editcount=2901, len~=45.0 > 15:39:38,194 WARN [IPC Server handler 82 on 60020] wal.HLog IPC Server > handler 82 on 60020 took 21784 ms appending an edit to hlog; > editcount=29944, len~=46.0 > 16:09:38,201 WARN [IPC Server handler 78 on 60020] wal.HLog IPC Server > handler 78 on 60020 took 10321 ms appending an edit to hlog; > editcount=163998, len~=104.0 > 16:09:38,203 WARN [IPC Server handler 97 on 60020] wal.HLog IPC Server > handler 97 on 60020 took 10205 ms appending an edit to hlog; > editcount=149497, len~=60.0 > 16:09:38,203 WARN [IPC Server handler 68 on 60020] wal.HLog IPC Server > handler 68 on 60020 took 10199 ms appending an edit to hlog; > editcount=318268, len~=63.0 > 16:09:38,203 WARN [IPC Server handler 120 on 60020] wal.HLog IPC Server > handler 120 on 60020 took 10211 ms appending an edit to hlog; > editcount=88001, len~=45.0 > 16:09:38,204 WARN [IPC Server handler 88 on 60020] wal.HLog IPC Server > handler 88 on 60020 took 10235 ms appending an edit to hlog; > editcount=141516, len~=100.0 > > The machines in the cluster are pretty powerful (8 HT cores, 48 GB RAM, 6 > SATA 7200 RPM disks) so we are not peaking on any hardware, for example CPU > is never over 20% used (avg 5%), network BW is never over 100 Mbps (with 1 > Gbps links) and 10k packets/s in each RS, RAM is 50% free (used for disk > cache) and random IOPS is well under 120 per sec (we should be able to > stand over 600). We have also monitored the GC for pauses (we have 16GB of > heap for the region servers) and we don't see pauses of more than a couple > tens of milliseconds (and concurrent sweeps of more than 5 or 6 seconds). > > The cluster runs Hadoop 1.0.0 and HBase 0.90.5 almost vanilla (just a > couple of functionality patches) and the input channel is thrift, with > autocommit disabled and periodic client flushes every 5 seconds. Our tables > have the deferred log flush feature enabled with a optional interval of > 1000 ms. We are not doing any MapReduce on this cluster, just HBase > concurrent writes and reads on three tables, two of them with 64 regions > and one with 1024 (a little bit high, I know) and an almost perfectly > random spread of keys throughout all the cluster. With this setup we are > able to see a peak of 250k requests/s (about 22k request per RS) with an > average of 100k request/s a day and an average of 76 bytes per request (so, > about 150 Mbps in throughput), because our payload is pretty small (less > than 20 bytes). > > The thrift servers are doing about 250 multi operations per second in each > RS, sustained (on peaks it is just packing more requests in each operation). > > So, the cluster is performing well on average, but these pauses are killing > us (they last up to 30 seconds on bursts of up to 3 minutes). Does anyone > have a clue as to what could be happening? > > Thanks. -- Todd Lipcon Software Engineer, Cloudera
