What do the MR logs say? Do they point to an explicit row or region for failed task? Can you trace life of that region by grepping it in master logs? At time that the failing task runs, grep the regionservers logs for that time. What is going on? Can you get from the bad region now, after MR has gone away? Are your tasktrackers running beside your datanodes and regionservers? Swapping? i/o stress? If so, try running w/ less concurrent tasks at a time. See if that helps?
St.Ack On Wed, May 19, 2010 at 9:23 AM, Geoff Hendrey <[email protected]> wrote: > Another update...actually even after the flush and major_compact the > Puts still stopped. I checked my job this morning and it had progressed > farther, but ultimately still was killed on the 10 minute timeouts. > > -geoff > > ________________________________ > > From: Geoff Hendrey > Sent: Wednesday, May 19, 2010 12:26 AM > To: '[email protected]' > Subject: RE: Put slows down and eventually blocks > > > Following up on my last post, I ran "flush" and "major_compact" from the > shell, and it seems to have jolted HBase into resuming writes. The > blocked Put method returned, and writes have now resumed normally. Any > ideas why? Here are a few other relevant details: > > > hbase(main):015:0> zk_dump > > HBase tree in ZooKeeper is rooted at /hbase > Cluster up? true > In safe mode? false > Master address: 10.241.6.82:60000 > Region server holding ROOT: 10.241.6.83:60020 > Region servers: > - 10.241.6.83:60020 > - 10.241.6.81:60020 > - 10.241.6.82:60020 > Quorum Server Statistics: > - dt5:2181 > Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT > Clients: > /10.241.6.81:52081[1](queued=0,recved=35496,sent=0) > /10.241.6.82:38365[1](queued=0,recved=32798,sent=0) > /10.241.6.82:60720[1](queued=0,recved=0,sent=0) > /10.241.6.82:40457[1](queued=0,recved=114,sent=0) > > Latency min/avg/max: 0/15/669 > Received: 73534 > Sent: 0 > Outstanding: 0 > Zxid: 0x500033498 > Mode: leader > Node count: 13 > - dt4:2181 > Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT > Clients: > /10.241.0.18:39273[1](queued=0,recved=34,sent=0) > /10.241.6.82:43315[1](queued=0,recved=0,sent=0) > /10.241.6.81:41762[1](queued=0,recved=169,sent=0) > /10.241.6.83:47803[1](queued=0,recved=35438,sent=0) > > Latency min/avg/max: 0/2/2249 > Received: 1432019 > Sent: 0 > Outstanding: 0 > Zxid: 0x500033498 > Mode: follower > Node count: 13 > - dt3:2181 > Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT > Clients: > /10.241.6.82:59048[1](queued=0,recved=0,sent=0) > /10.241.6.82:50822[1](queued=0,recved=36260,sent=0) > /10.241.6.81:45696[1](queued=0,recved=30691,sent=0) > /10.241.6.83:50027[1](queued=0,recved=36261,sent=0) > /10.241.6.82:50823[1](queued=0,recved=36270,sent=0) > > Latency min/avg/max: 0/3/40 > Received: 140600 > Sent: 0 > Outstanding: 0 > Zxid: 0x500033498 > Mode: follower > Node count: 13 > hbase(main):016:0> status > 3 servers, 0 dead, 227.3333 average load > hbase(main):017:0> flush "SEARCH_KEYS" > 0 row(s) in 0.7600 seconds > hbase(main):018:0> status > 3 servers, 0 dead, 228.3333 average load > > > ________________________________ > > From: Geoff Hendrey > Sent: Tuesday, May 18, 2010 11:56 PM > To: [email protected] > Subject: Put slows down and eventually blocks > > > I am experiencing a problem in which Put operations transition from > working just fine, to blocking forever. I am doing Put from a reducer. I > have tried the following, but none of them prevents the Puts from > eventually blocking totally in all the reducers, until the task tracker > kills the task due to 10 minute timeout. > > 1) try using just one reducer (didn't help) > 2) try Put.setWriteToWall both true and false (didn't help) > 3) try autoflush true and false. When true, experiment with different > flush buffer sizes (didn't help) > > I'v been watching the HDFS namenode and datanode logs, and also the > HBase master and region servers. I am running a 3-node HDFS cluster > (20.2) sharing same 3 nodes with HBase 20.3. I see no problems in any > logs, except that the datanode logs eventually stop showing WRITE > operations (corresponding to the Put operations eventually coming to a > halt). The HBase shell remains snappy and I can do list and status > operations and scans without any issue from the shell. > > Anyone ever seen anything like this? > > -geoff > > > <blocked::http://www.decarta.com> > > > > >
