Answers prefixed by "geoff:": 
>What do the MR logs say?  Do they point to an explicit row or region for 
>failed task? 

Geoff: MR syslog shows nothing abnormal that I can see. I am running just a 
single reducer, and autoflushing puts. I have pasted the reducer logs at the 
end of my email. As context for the information that I've pasted - I 
instantiate the HTable in the setup method, can I close it in cleanup.

>What do the MR logs say?  Do they point to an explicit row or region for 
>failed task?  Can you trace life of that region by grepping it in master logs? 
> 

Geoff: I was writing the row key to stdout so I could collect debugging. So I 
know the row key, and I grepped all hadoop and hbase logs on all three servers 
that participate in the little Hbase cluster. Nothing in the logs that greps 
out for the particular row key that the put hangs on.

>At time that the failing task runs, grep the regionservers logs for that time. 
> What is going on? 

Geoff: Ahhh, glad you asked. I see the Put connection crapping out (of course, 
because the task is killed), but ten minutes earlier a global memstore fush 
started, and from then until the Put is killed it's wall to wall compaction and 
memstore flushing. It seems clear that this caused the server to block the Put, 
and the compaction and flushing went on for 10 minutes, at which time the task 
tracker killed the job, and the flushing and compations continued for a total 
of about 12 minutes. This also explains why the tasks are killed so reliably at 
the same time each time I try to run this job - that is, because each time the 
reducer has written a certain amount of data, this flushing happens. So, should 
I try decreasing the size of the global memstore? It seems like I have to trade 
performance against these long pauses; the bigger the memstore, the longer the 
pauses. Does that make sense? Any recommendations for global memstore size on a 
machine with 32 GB ram, 64-bit JVM, and 8 GB ram per Hbase process?

Thanks -- we're getting somewhere. (MR syslog follows)

-geoff




2010-05-19 01:00:52,720 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=SHUFFLE, sessionId=
2010-05-19 01:00:52,894 INFO org.apache.hadoop.mapred.ReduceTask: 
ShuffleRamManager: MemoryLimit=361909440, MaxSingleShuffleLimit=90477360
2010-05-19 01:00:52,901 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Thread started: Thread for merging on-disk 
files
2010-05-19 01:00:52,902 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Thread waiting: Thread for merging on-disk 
files
2010-05-19 01:00:52,903 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Thread started: Thread for merging in 
memory files
2010-05-19 01:00:52,904 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Need another 4 map output(s) where 0 is 
already in progress
2010-05-19 01:00:52,904 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Scheduled 0 outputs (0 slow hosts and0 dup 
hosts)
2010-05-19 01:00:52,904 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Thread started: Thread for polling Map 
Completion Events
2010-05-19 01:00:52,911 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
obsolete output of KILLED map-task: 'attempt_201005141621_0203_m_000001_1'
2010-05-19 01:00:52,911 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
obsolete output of KILLED map-task: 'attempt_201005141621_0203_m_000002_1'
2010-05-19 01:00:52,911 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
obsolete output of KILLED map-task: 'attempt_201005141621_0203_m_000003_1'
2010-05-19 01:00:52,911 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3: Got 4 new map-outputs
2010-05-19 01:00:57,906 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Scheduled 4 outputs (0 slow hosts and0 dup 
hosts)
2010-05-19 01:00:57,960 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_201005141621_0203_m_000000_0, compressed len: 216313347, decompressed 
len: 216313343
2010-05-19 01:00:57,960 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_201005141621_0203_m_000001_0, compressed len: 207291891, decompressed 
len: 207291887
2010-05-19 01:00:57,960 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
216313343 bytes (216313347 raw bytes) into Local-FS from 
attempt_201005141621_0203_m_000000_0
2010-05-19 01:00:57,960 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
207291887 bytes (207291891 raw bytes) into Local-FS from 
attempt_201005141621_0203_m_000001_0
2010-05-19 01:00:57,961 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_201005141621_0203_m_000003_0, compressed len: 222666366, decompressed 
len: 222666362
2010-05-19 01:00:57,961 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
222666362 bytes (222666366 raw bytes) into Local-FS from 
attempt_201005141621_0203_m_000003_0
2010-05-19 01:00:57,996 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_201005141621_0203_m_000002_0, compressed len: 196007332, decompressed 
len: 196007328
2010-05-19 01:00:57,997 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
196007328 bytes (196007332 raw bytes) into Local-FS from 
attempt_201005141621_0203_m_000002_0
2010-05-19 01:01:05,103 INFO org.apache.hadoop.mapred.ReduceTask: Read 
207291891 bytes from map-output for attempt_201005141621_0203_m_000001_0
2010-05-19 01:01:05,106 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Thread waiting: Thread for merging on-disk 
files
2010-05-19 01:01:05,155 INFO org.apache.hadoop.mapred.ReduceTask: Read 
222666366 bytes from map-output for attempt_201005141621_0203_m_000003_0
2010-05-19 01:01:05,161 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Thread waiting: Thread for merging on-disk 
files
2010-05-19 01:01:05,307 INFO org.apache.hadoop.mapred.ReduceTask: Read 
216313347 bytes from map-output for attempt_201005141621_0203_m_000000_0
2010-05-19 01:01:05,308 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Thread waiting: Thread for merging on-disk 
files
2010-05-19 01:01:05,379 INFO org.apache.hadoop.mapred.ReduceTask: Read 
196007332 bytes from map-output for attempt_201005141621_0203_m_000002_0
2010-05-19 01:01:05,380 INFO org.apache.hadoop.mapred.ReduceTask: 
attempt_201005141621_0203_r_000000_3 Thread waiting: Thread for merging on-disk 
files
2010-05-19 01:01:05,949 INFO org.apache.hadoop.mapred.ReduceTask: 
GetMapEventsThread exiting
2010-05-19 01:01:05,950 INFO org.apache.hadoop.mapred.ReduceTask: 
getMapsEventsThread joined.
2010-05-19 01:01:05,951 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram 
manager
2010-05-19 01:01:05,951 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved 
on-disk merge complete: 4 files left.
2010-05-19 01:01:05,951 INFO org.apache.hadoop.mapred.ReduceTask: In-memory 
merge complete: 0 files left.
2010-05-19 01:01:05,956 INFO org.apache.hadoop.mapred.ReduceTask: Merging 4 
files, 842278936 bytes from disk
2010-05-19 01:01:05,957 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 
segments, 0 bytes from memory into reduce
2010-05-19 01:01:05,960 INFO org.apache.hadoop.mapred.Merger: Merging 4 sorted 
segments
2010-05-19 01:01:05,972 INFO org.apache.hadoop.mapred.Merger: Down to the last 
merge-pass, with 4 segments left of total size: 842278920 bytes
2010-05-19 01:01:06,116 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 21:51 GMT
2010-05-19 01:01:06,116 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:host.name=ddbs-host7-dm0
2010-05-19 01:01:06,116 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.version=1.6.0_17
2010-05-19 01:01:06,116 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.vendor=Sun Microsystems Inc.
2010-05-19 01:01:06,116 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.home=/usr/java/jdk1.6.0_17/jre
2010-05-19 01:01:06,116 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.class.path=/home/dmc2/hadoop/bin/../conf:/usr/java/default/lib/tools.jar:/home/dmc2/hadoop/bin/..:/home/dmc2/hadoop/bin/../hadoop-0.20.2-core.jar:/home/dmc2/hadoop/bin/../lib/commons-cli-1.2.jar:/home/dmc2/hadoop/bin/../lib/commons-codec-1.3.jar:/home/dmc2/hadoop/bin/../lib/commons-el-1.0.jar:/home/dmc2/hadoop/bin/../lib/commons-httpclient-3.0.1.jar:/home/dmc2/hadoop/bin/../lib/commons-logging-1.0.4.jar:/home/dmc2/hadoop/bin/../lib/commons-logging-api-1.0.4.jar:/home/dmc2/hadoop/bin/../lib/commons-net-1.4.1.jar:/home/dmc2/hadoop/bin/../lib/core-3.1.1.jar:/home/dmc2/hadoop/bin/../lib/hsqldb-1.8.0.10.jar:/home/dmc2/hadoop/bin/../lib/jasper-compiler-5.5.12.jar:/home/dmc2/hadoop/bin/../lib/jasper-runtime-5.5.12.jar:/home/dmc2/hadoop/bin/../lib/jets3t-0.6.1.jar:/home/dmc2/hadoop/bin/../lib/jetty-6.1.14.jar:/home/dmc2/hadoop/bin/../lib/jetty-util-6.1.14.jar:/home/dmc2/hadoop/bin/../lib/junit-3.8.1.jar:/home/dmc2/hadoop/bin/../lib/kfs-0.2.2.jar:/home/dmc2/hadoop/bin/../lib/log4j-1.2.15.jar:/home/dmc2/hadoop/bin/../lib/mockito-all-1.8.0.jar:/home/dmc2/hadoop/bin/../lib/oro-2.0.8.jar:/home/dmc2/hadoop/bin/../lib/servlet-api-2.5-6.1.14.jar:/home/dmc2/hadoop/bin/../lib/slf4j-api-1.4.3.jar:/home/dmc2/hadoop/bin/../lib/slf4j-log4j12-1.4.3.jar:/home/dmc2/hadoop/bin/../lib/xmlenc-0.52.jar:/home/dmc2/hadoop/bin/../lib/jsp-2.1/jsp-2.1.jar:/home/dmc2/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar::/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/xsltc.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jh.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/opencsv-1.8.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/xalan.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jsr311-api-1.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/hadoop-0.20.1-core.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/Echo.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/xsdlib.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/saaj-api.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/ant-launcher.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/junit.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/Echo_FileTransfer.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jta-spec1_0_1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/hadoop-0.20.1-tools.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-io-1.4.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/servletapi-2.3.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jaxb-impl.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/lucene-queries-2.4.0.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/json.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/lucene-spellchecker-2.4.0.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/relaxngDatatype.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/xmlsec.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/saaj-impl.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-logging-1.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/xws-security.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-codec-1.3.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-cli-1.2.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/workflow.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/lucene-core-2.4.0.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/hadoop-0.20.1-streaming.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jaxb-libs.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/hadoop-0.20.1-index.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/mail.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jaxb-xjc.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jcommon-0.9.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/log4j-1.2.16.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/ant.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/xmldsig.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/decarta-util.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/marsh.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/hadoopxmlstreamer.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/xercesImpl.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jax-qname.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/sax.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jaxp-api.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/hadoop-0.20.1-ant.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-collections-3.2.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jxl.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/lucene-snowball-2.4.0.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-dbcp-1.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/dom4j-full.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/lucene-analyzers-2.4.0.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-pool-1.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/EchoServer.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-httpclient-3.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jaxb-api.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/gnujaxp.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/echopoint-0.9.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/zookeeper-3.2.2.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/signpost-core-1.1-SNAPSHOT.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/mysql-connector-java-5.0.8-bin.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/shades-0.0.5.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/commons-fileupload-1.2.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/dds-head.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/namespace.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jersey-bundle-1.1.5.1.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/xml4j.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/cobertura.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/hbase-0.20.3.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/jfreechart-0.9.16.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/lib/hsqldb-2.0.0-rc8.jar:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars/classes:/hive3/mapred/local/taskTracker/jobcache/job_201005141621_0203/jars:/hive4/mapred/local/taskTracker/jobcache/job_201005141621_0203/attempt_201005141621_0203_r_000000_3/work
2010-05-19 01:01:06,116 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.library.path=/home/dmc2/hadoop/bin/../lib/native/Linux-i386-32:/hive4/mapred/local/taskTracker/jobcache/job_201005141621_0203/attempt_201005141621_0203_r_000000_3/work
2010-05-19 01:01:06,117 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.io.tmpdir=/hive4/mapred/local/taskTracker/jobcache/job_201005141621_0203/attempt_201005141621_0203_r_000000_3/work/tmp
2010-05-19 01:01:06,117 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.compiler=<NA>
2010-05-19 01:01:06,117 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:os.name=Linux
2010-05-19 01:01:06,117 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:os.arch=i386
2010-05-19 01:01:06,117 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:os.version=2.6.18-128.el5PAE
2010-05-19 01:01:06,117 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:user.name=dmc2
2010-05-19 01:01:06,117 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:user.home=/home/dmc2
2010-05-19 01:01:06,117 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:user.dir=/disk4/hive4/mapred/local/taskTracker/jobcache/job_201005141621_0203/attempt_201005141621_0203_r_000000_3/work
2010-05-19 01:01:06,118 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=dt4.dt.sv4.decarta.com:2181 sessionTimeout=60000 
watcher=org.apache.hadoop.hbase.client.hconnectionmanager$clientzkwatc...@11f23e5
2010-05-19 01:01:06,120 INFO org.apache.zookeeper.ClientCnxn: 
zookeeper.disableAutoWatchReset is false
2010-05-19 01:01:06,136 INFO org.apache.zookeeper.ClientCnxn: Attempting 
connection to server dt4.dt.sv4.decarta.com/10.241.6.82:2181
2010-05-19 01:01:06,141 INFO org.apache.zookeeper.ClientCnxn: Priming 
connection to java.nio.channels.SocketChannel[connected 
local=/10.241.0.69:34497 remote=dt4.dt.sv4.decarta.com/10.241.6.82:2181]
2010-05-19 01:01:06,155 INFO org.apache.zookeeper.ClientCnxn: Server connection 
successful
2010-05-19 01:11:24,084 INFO org.apache.zookeeper.ZooKeeper: Closing session: 
0x1288430447a0449
2010-05-19 01:11:24,084 INFO org.apache.zookeeper.ClientCnxn: Closing 
ClientCnxn for session: 0x1288430447a0449


-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stack
Sent: Wednesday, May 19, 2010 3:29 PM
To: [email protected]
Subject: Re: Put slows down and eventually blocks

What do the MR logs say?  Do they point to an explicit row or region for failed 
task?  Can you trace life of that region by grepping it in master logs?  

At time that the failing task runs, grep the regionservers logs for that time.  
What is going on?  Can you get from the bad region now, after MR has gone away? 
 Are your tasktrackers running beside your datanodes and regionservers?  
Swapping?  i/o stress?  If so, try running w/ less concurrent tasks at a time.  
See if that helps?

St.Ack

On Wed, May 19, 2010 at 9:23 AM, Geoff Hendrey <[email protected]> wrote:
> Another update...actually even after the flush and major_compact the 
> Puts still stopped. I checked my job this morning and it had 
> progressed farther, but ultimately still was killed on the 10 minute timeouts.
>
> -geoff
>
> ________________________________
>
> From: Geoff Hendrey
> Sent: Wednesday, May 19, 2010 12:26 AM
> To: '[email protected]'
> Subject: RE: Put slows down and eventually blocks
>
>
> Following up on my last post, I ran "flush" and "major_compact" from 
> the shell, and it seems to have jolted HBase into resuming writes. The 
> blocked Put method returned, and writes have now resumed normally. Any 
> ideas why? Here are a few other relevant details:
>
>
> hbase(main):015:0> zk_dump
>
> HBase tree in ZooKeeper is rooted at /hbase
>  Cluster up? true
>  In safe mode? false
>  Master address: 10.241.6.82:60000
>  Region server holding ROOT: 10.241.6.83:60020
>  Region servers:
>    - 10.241.6.83:60020
>    - 10.241.6.81:60020
>    - 10.241.6.82:60020
>  Quorum Server Statistics:
>    - dt5:2181
>        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
>        Clients:
>         /10.241.6.81:52081[1](queued=0,recved=35496,sent=0)
>         /10.241.6.82:38365[1](queued=0,recved=32798,sent=0)
>         /10.241.6.82:60720[1](queued=0,recved=0,sent=0)
>         /10.241.6.82:40457[1](queued=0,recved=114,sent=0)
>
>        Latency min/avg/max: 0/15/669
>        Received: 73534
>        Sent: 0
>        Outstanding: 0
>        Zxid: 0x500033498
>        Mode: leader
>        Node count: 13
>    - dt4:2181
>        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
>        Clients:
>         /10.241.0.18:39273[1](queued=0,recved=34,sent=0)
>         /10.241.6.82:43315[1](queued=0,recved=0,sent=0)
>         /10.241.6.81:41762[1](queued=0,recved=169,sent=0)
>         /10.241.6.83:47803[1](queued=0,recved=35438,sent=0)
>
>        Latency min/avg/max: 0/2/2249
>        Received: 1432019
>        Sent: 0
>        Outstanding: 0
>        Zxid: 0x500033498
>        Mode: follower
>        Node count: 13
>    - dt3:2181
>        Zookeeper version: 3.2.2-888565, built on 12/08/2009 21:51 GMT
>        Clients:
>         /10.241.6.82:59048[1](queued=0,recved=0,sent=0)
>         /10.241.6.82:50822[1](queued=0,recved=36260,sent=0)
>         /10.241.6.81:45696[1](queued=0,recved=30691,sent=0)
>         /10.241.6.83:50027[1](queued=0,recved=36261,sent=0)
>         /10.241.6.82:50823[1](queued=0,recved=36270,sent=0)
>
>        Latency min/avg/max: 0/3/40
>        Received: 140600
>        Sent: 0
>        Outstanding: 0
>        Zxid: 0x500033498
>        Mode: follower
>        Node count: 13
> hbase(main):016:0> status
> 3 servers, 0 dead, 227.3333 average load hbase(main):017:0> flush 
> "SEARCH_KEYS"
> 0 row(s) in 0.7600 seconds
> hbase(main):018:0> status
> 3 servers, 0 dead, 228.3333 average load
>
>
> ________________________________
>
> From: Geoff Hendrey
> Sent: Tuesday, May 18, 2010 11:56 PM
> To: [email protected]
> Subject: Put slows down and eventually blocks
>
>
> I am experiencing a problem in which Put operations transition from 
> working just fine, to blocking forever. I am doing Put from a reducer. 
> I have tried the following, but none of them prevents the Puts from 
> eventually blocking totally in all the reducers, until the task 
> tracker kills the task due to 10 minute timeout.
>
> 1) try using just one reducer (didn't help)
> 2) try Put.setWriteToWall both true and false (didn't help)
> 3) try autoflush true and false. When true, experiment with different 
> flush buffer sizes (didn't help)
>
> I'v been watching the HDFS namenode and datanode logs, and also the 
> HBase master and region servers. I am running a 3-node HDFS cluster
> (20.2) sharing same 3 nodes with HBase 20.3. I see no problems in any 
> logs, except that the datanode logs eventually stop showing WRITE 
> operations (corresponding to the Put operations eventually coming to a 
> halt). The HBase shell remains snappy and I can do list and status 
> operations and scans without any issue from the shell.
>
> Anyone ever seen anything like this?
>
> -geoff
>
>
> <blocked::http://www.decarta.com>
>
>
>
>
>

Reply via email to