Re: Sync latency

Todd Lipcon Thu, 05 Apr 2012 11:16:55 -0700

Hi Placido,

Sounds like it might be related to HDFS-2379. Try updating to Hadoop
1.0.1 or CDH3u3 and you'll get a fix for that.


You can verify by grepping for "BlockReport" in your DN logs - if the
pauses on the hbase side correlate with long block reports on the DNs,
the upgrade should fix it.

-Todd

On Wed, Apr 4, 2012 at 2:30 AM, Placido Revilla
<[email protected]> wrote:
> Hi,
>
> I'm having a problem with sync latency on our HBase cluster. Our cluster is
> composed of 2 NN (and HBase master) machines and 12 DN (and HBase
> regionservers and thrift servers). We are having several issues a day where
> the cluster seems to halt all processing during several seconds, and these
> times are aligned with some WARN logs:
>
> 13:23:55,285 WARN  [IPC Server handler 62 on 60020] wal.HLog IPC Server
> handler 62 on 60020 took 10713 ms appending an edit to hlog;
> editcount=150694, len~=58.0
> 13:23:55,286 WARN  [IPC Server handler 64 on 60020] wal.HLog IPC Server
> handler 64 on 60020 took 10726 ms appending an edit to hlog;
> editcount=319217, len~=47.0
> 13:23:55,286 WARN  [IPC Server handler 118 on 60020] wal.HLog IPC Server
> handler 118 on 60020 took 10741 ms appending an edit to hlog;
> editcount=373337, len~=49.0
> 13:23:55,286 WARN  [IPC Server handler 113 on 60020] wal.HLog IPC Server
> handler 113 on 60020 took 10746 ms appending an edit to hlog;
> editcount=57912, len~=45.0
> 15:39:38,193 WARN  [IPC Server handler 94 on 60020] wal.HLog IPC Server
> handler 94 on 60020 took 21787 ms appending an edit to hlog;
> editcount=2901, len~=45.0
> 15:39:38,194 WARN  [IPC Server handler 82 on 60020] wal.HLog IPC Server
> handler 82 on 60020 took 21784 ms appending an edit to hlog;
> editcount=29944, len~=46.0
> 16:09:38,201 WARN  [IPC Server handler 78 on 60020] wal.HLog IPC Server
> handler 78 on 60020 took 10321 ms appending an edit to hlog;
> editcount=163998, len~=104.0
> 16:09:38,203 WARN  [IPC Server handler 97 on 60020] wal.HLog IPC Server
> handler 97 on 60020 took 10205 ms appending an edit to hlog;
> editcount=149497, len~=60.0
> 16:09:38,203 WARN  [IPC Server handler 68 on 60020] wal.HLog IPC Server
> handler 68 on 60020 took 10199 ms appending an edit to hlog;
> editcount=318268, len~=63.0
> 16:09:38,203 WARN  [IPC Server handler 120 on 60020] wal.HLog IPC Server
> handler 120 on 60020 took 10211 ms appending an edit to hlog;
> editcount=88001, len~=45.0
> 16:09:38,204 WARN  [IPC Server handler 88 on 60020] wal.HLog IPC Server
> handler 88 on 60020 took 10235 ms appending an edit to hlog;
> editcount=141516, len~=100.0
>
> The machines in the cluster are pretty powerful (8 HT cores, 48 GB RAM, 6
> SATA 7200 RPM disks) so we are not peaking on any hardware, for example CPU
> is never over 20% used (avg 5%), network BW is never over 100 Mbps (with 1
> Gbps links) and 10k packets/s in each RS, RAM is 50% free (used for disk
> cache) and random IOPS is well under 120 per sec (we should be able to
> stand over 600). We have also monitored the GC for pauses (we have 16GB of
> heap for the region servers) and we don't see pauses of more than a couple
> tens of milliseconds (and concurrent sweeps of more than 5 or 6 seconds).
>
> The cluster runs Hadoop 1.0.0 and HBase 0.90.5 almost vanilla (just a
> couple of functionality patches) and the input channel is thrift, with
> autocommit disabled and periodic client flushes every 5 seconds. Our tables
> have the deferred log flush feature enabled with a optional interval of
> 1000 ms. We are not doing any MapReduce on this cluster, just HBase
> concurrent writes and reads on three tables, two of them with 64 regions
> and one with 1024 (a little bit high, I know) and an almost perfectly
> random spread of keys throughout all the cluster. With this setup we are
> able to see a peak of 250k requests/s (about 22k request per RS) with an
> average of 100k request/s a day and an average of 76 bytes per request (so,
> about 150 Mbps in throughput), because our payload is pretty small (less
> than 20 bytes).
>
> The thrift servers are doing about 250 multi operations per second in each
> RS, sustained (on peaks it is just packing more requests in each operation).
>
> So, the cluster is performing well on average, but these pauses are killing
> us (they last up to 30 seconds on bursts of up to 3 minutes). Does anyone
> have a clue as to what could be happening?
>
> Thanks.



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Sync latency

Reply via email to