Hi, Stack,
Thank you very much. We didn't do any GC tuning and just use the
default config. We may try to upgrade our hbase version later.
I extracted more log before and after the failure point:
12-02-15 21:38:11,399 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
test_table,alexander hawk,1329369807272.26c8f0b19f1b475e526b12d639dc0241.
due to global heap pressure
2012-02-15 21:38:12,616 INFO org.apache.hadoop.hbase.regionserver.Store:
Renaming flushed file at
hdfs://tianweis:8020/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/.tmp/2688884176779434164
to
hdfs://tianweis:8020/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/attributes/6670946670903024661
2012-02-15 21:38:16,592 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://tianweis:8020/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/attributes/6670946670903024661,
entries=2644, sequenceid=1305864181, memsize=5.6m, filesize=3.4m
2012-02-15 21:38:16,594 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed test_table,ali ramos,1329364709394.5e1b41c1ea5e87d75fbac2e5fb26e68b.
2012-02-15 21:38:30,379 INFO org.apache.hadoop.hbase.regionserver.Store:
Renaming flushed file at
hdfs://tianweis:8020/hbase/test_table/26c8f0b19f1b475e526b12d639dc0241/.tmp/1543489781580457584
to
hdfs://tianweis:8020/hbase/test_table/26c8f0b19f1b475e526b12d639dc0241/attributes/7028169816865652663
2012-02-15 21:38:44,073 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://tianweis:8020/hbase/test_table/26c8f0b19f1b475e526b12d639dc0241/attributes/7028169816865652663,
entries=20658, sequenceid=1305864185, memsize=44.1m, filesize=23.5m
2012-02-15 21:38:44,074 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke up
with memory above low water.
2012-02-15 21:38:44,075 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Under global heap
pressure: Region test_table,alex
yost,1329368982672.f55282926b9e8652f7bb0b98616a8216. has too many store
files, but is 88.2m vs best flushable region's 40.4m. Choosing the bigger.
2012-02-15 21:38:44,075 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
test_table,alex yost,1329368982672.f55282926b9e8652f7bb0b98616a8216. due to
global heap pressure
2012-02-15 21:38:51,349 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/splits/3873b831c5d2e7ca743e97e77ff6ddc9/attributes/1099303928625969539.5e1b41c1ea5e87d75fbac2e5fb26e68b
retrying...
2012-02-15 21:38:51,349 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/splits/3873b831c5d2e7ca743e97e77ff6ddc9/attributes/6670946670903024661.5e1b41c1ea5e87d75fbac2e5fb26e68b
retrying...
2012-02-15 21:38:51,349 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/hbase/test_table/5e1b41c1ea5e87d75fbac2e5fb26e68b/splits/3873b831c5d2e7ca743e97e77ff6ddc9/attributes/8453068235826691778.5e1b41c1ea5e87d75fbac2e5fb26e68b
retrying...
2012-02-15 21:38:56,164 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback
of failed split of test_table,ali
ramos,1329364709394.5e1b41c1ea5e87d75fbac2e5fb26e68b.; Took too long to
split the files and create the references, aborting split
2012-02-15 21:39:58,011 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=tianweis,62222,1329359487662, load=(requests=0, regions=1350,
usedHeap=8560, maxHeap=11979): Replay of HLog required. Forcing server
shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: test_table,alex
yost,1329368982672.f55282926b9e8652f7bb0b98616a8216.
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:990)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:895)
at
org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:847)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:386)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:194)
at
org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:214)
Caused by: java.io.IOException: Call to
tianweis/137.110.161.244:8020failed on local exception:
java.nio.channels.ClosedByInterruptException
at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
at org.apache.hadoop.ipc.Client.call(Client.java:743)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy6.complete(Unknown Source)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy6.complete(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3457)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3381)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
at
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
at org.apache.hadoop.hbase.io.hfile.HFile$Writer.close(HFile.java:655)
at
org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:877)
at
org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:488)
at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:448)
at org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:81)
at
org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1509)
at
org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:968)
... 5 more
Caused by: java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:201)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:359)
at
org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
2012-02-15 21:39:58,014 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
request=0.0, regions=1350, stores=2147, storefiles=3595,
storefileIndexSize=2286, memstoreSize=4792, compactionQueueSize=119,
flushQueueSize=66, usedHeap=8689, maxHeap=11979, blockCacheSize=944895704,
blockCacheFree=1567335208, blockCacheCount=7739, blockCacheHitCount=17170,
blockCacheMissCount=5362757, blockCacheEvictedCount=0,
blockCacheHitRatio=0, blockCacheHitCachingRatio=68
2012-02-15 21:39:58,015 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Replay of HLog
required. Forcing server shutdown
2012-02-15 21:39:58,034 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Excluding unflushable
region test_table,alex yost,1329368982672.f55282926b9e8652f7bb0b98616a8216.
- trying to find a different region to flush.
2012-02-15 21:39:58,034 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
server on 62222
Thanks.
Tianwei
On Thu, Feb 16, 2012 at 9:02 AM, Stack <[email protected]> wrote:
> On Wed, Feb 15, 2012 at 10:44 PM, Tianwei <[email protected]> wrote:
> > limit, it will die. From the log, it seems that the compact/split will
> fail
> > due to the memory problem:
> > 2012-02-15 21:39:58,013 ERROR
> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
> > failed for region test_table,ali
> > ramos,1329364709394.5e1b41c1ea5e87d75fbac2e5fb26e68b.
> >
> > I know very little about the internal implementation of hbase, could you
> > guys give me some suggestions for the following questions:
> > 1. Why the memory usage of the region server keep increasing? Is it
> > simply because I am writing big data into the hbase table? Which parts
> of
> > hbase will use more memory for increasing table size? Are there any
> > configuration options for me to alleviate this problem?
> >
> > 2. Why the region server die? is it because the GC is not quick enough
> > to free memory for hbase? I assume that writing data, compacting/spliting
> > all need to allocate new memory. And if the GC is not quick enough, these
> > function will simply get exception and cause the region server to die. Is
> > that right?
> >
>
> Need more logs from regionserver and you should enable GC logging.
>
> Heap increases as you put more data in (more regions, more memstores).
> Its natural!
>
> My guess is you are losing your session with zookeeper because of a
> big GC pause. Have you done any GC tuning? Using default configs?
>
> >
> > We used hbase-0.90.1 version and this problem really bothers us a lot.
> Hope
> > you can give us some suggestions.
> >
>
> Update your hbase. That'll probably help too.
>
> St.Ack
>