thanks everyone for responding.

No I don't have the GC logs. I don't even know how i can get that. but it seems that the regionserver did recovere from that and then gets into trouble here:

2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.regionserver.HRegion: compaction interrupted by user: java.io.InterruptedIOException: Aborting compaction of store f in region t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23. because user requested stop.

the part that I don't understand is what it means when it say "compaction interrupted by user"!

and to answer your question ted, I am using 0.90.6 over hadoop 1.1.1 ( i can't upgrade since gora so far only works with .90.x ) and no everything was normal as far as I could say the map jobs were staggering since, i assume, the hbase became unresponsive ( the web interface start showing exception and that is how i figured out that that regionserver was down) , while i was restarting this one ( through the status command in shell ) i noticed that two more regionserver went down ( with identicall error , the second one, not the one about GC pause ) but once I restarted the regionservers (using hbase-daemon.sh) everything went back to normal. but this keeps happening and as a result i can't left my jobs unsupervised.

thanks,

On 04/22/2013 07:35 PM, Ted Yu wrote:
Kaveh:
What version of HBase are you using ?
Around 2013-04-22 16:47:56, did you observe anything else happening in your
cluster ? See below:

2013-04-22 16:47:56,830 INFO org.apache.hadoop.hbase.**regionserver.HRegion:
compaction interrupted by user:
java.io.**InterruptedIOException: Aborting compaction of store f in region
t1_webpage,com.pandora.www:**http/shaggy,1366670139658.**9f565d5
da3468c0725e590dc232abc**23. because user requested stop.
         at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.
java:998)
         at org.apache.hadoop.hbase.**regionserver.Store.compact(**Store.
java:779)
         at org.apache.hadoop.hbase.**regionserver.HRegion.**compactStores(
HRegion.java:**776)

On Mon, Apr 22, 2013 at 6:46 PM, Jean-Marc Spaggiari <
[email protected]> wrote:

Hi Kaveh,

the respons is maybe already displayed on the logs you sent ;)

"This disconnect could have been caused by a network partition or a
long-running GC pause, either way it's recommended that you verify
your environment."

Do you have GC logs? Have you tried anything to solve that?

JM

2013/4/22 kaveh minooie <[email protected]>:
Hi

after a few mapreduce jobs my regionservers shut themselves down. this is
the latest time that this has happened:

2013-04-22 16:47:21,843 INFO

org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
This client just lost it's session with ZooKeeper, trying to reconnect.
2013-04-22 16:47:21,843 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
server
serverName=d1r1n17.prod.plutoz.com,60020,1366657358443, load=(requests=5
392, regions=196, usedHeap=1063, maxHeap=3966):
regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired
fr
om ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
         at

org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
         at

org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
         at

org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:523)
         at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:499)
2013-04-22 16:47:21,843 INFO

org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Trying to reconnect to zookeeper.
2013-04-22 16:47:21,844 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
requests=1794, regions=196, stores=1561, storefiles=1585,
storefileIndexSize=104, memstoreSize=306, compactionQueueSize=10,
flushQueueSize=0, usedHeap=1073, maxHeap=3966, blockCacheSize=661986032,
blockCacheFree=169901776, blockCacheCount=7242,
blockCacheHitCount=910925,
blockCacheMissCount=1558134, blockCacheEvictedCount=1344753,
blockCacheHitRatio=36, blockCacheHitCachingRatio=40
2013-04-22 16:47:21,844 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED:
regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661
regionserver:60020-0x13dd980d2ab8661-0x13dd980d2ab8661 received expired
from
ZooKeeper, aborting
2013-04-22 16:47:21,844 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2013-04-22 16:47:21,900 WARN
org.apache.hadoop.hbase.regionserver.wal.HLog:
Too many consecutive RollWriter requests, it's a sign of the total
number of
live datanodes is lower than the tolerable replicas.
2013-04-22 16:47:22,341 INFO org.apache.zookeeper.ZooKeeper: Initiating
client connection, connectString=zk1:2181 sessionTimeout=180000
watcher=hconnection
2013-04-22 16:47:22,357 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Waiting on 1 regions
to
close
2013-04-22 16:47:22,394 INFO org.apache.zookeeper.ClientCnxn: Opening
socket
connection to server d1r2n2.prod.plutoz.com/10.0.0.66:2181. Will not
attempt
to authenticate using SASL (unknown error)
2013-04-22 16:47:22,395 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to d1r2n2.prod.plutoz.com/10.0.0.66:2181,
initiating
session
2013-04-22 16:47:22,397 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server d1r2n2.prod.plutoz.com/10.0.0.66:2181,
sessionid = 0x13dd980d2abbf93, negotiated timeout = 40000
2013-04-22 16:47:22,400 INFO

org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Reconnected successfully. This disconnect could have been caused by a
network partition or a long-running GC pause, either way it's recommended
that you verify your environment.
2013-04-22 16:47:22,400 INFO org.apache.zookeeper.ClientCnxn: EventThread
shut down
2013-04-22 16:47:56,830 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
compaction interrupted by user:
java.io.InterruptedIOException: Aborting compaction of store f in region

t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
because user requested stop.
         at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
         at
org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
         at

org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
         at

org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
         at

org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
2013-04-22 16:47:56,830 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
aborted compaction on region

t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
after 5mins, 58sec
2013-04-22 16:47:56,830 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread:
regionserver60020.compactor exiting
2013-04-22 16:47:56,832 INFO
org.apache.hadoop.hbase.regionserver.HRegion:
Closed

t1_webpage,com.pandora.www:http/shaggy,1366670139658.9f565d5da3468c0725e590dc232abc23.
2013-04-22 16:47:57,363 INFO
org.apache.hadoop.hbase.regionserver.wal.HLog:
regionserver60020.logSyncer exiting
2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020 closing leases
2013-04-22 16:47:57,366 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020 closed leases
2013-04-22 16:47:57,366 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
exiting
2013-04-22 16:47:57,497 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
starting;
hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-15,5,main]
2013-04-22 16:47:57,497 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
hook
2013-04-22 16:47:57,497 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown
hook
thread.
2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020.leaseChecker closing leases
2013-04-22 16:47:57,504 INFO org.apache.hadoop.hbase.regionserver.Leases:
regionserver60020.leaseChecker closed leases
2013-04-22 16:47:57,598 INFO
org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
finished.
I would appreciate it very much if someone could explain to me what just
happened here.

thanks,

Reply via email to