> HBase and HDFS are not made to span datacenters.
Yes, I know. I think the correct terminology is 3 availability zones
within the same region (ping between them is about 1ms)
> You could paste the thread dump
This is the full stacktrace:
Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode):
"LruBlockCache.EvictionThread" daemon prio=10 tid=0x0000000052cc8000
nid=0x2859 in Object.wait() [0x0000000044510000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000781b79cf0> (a
org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread.run(LruBlockCache.java:519)
- locked <0x0000000781b79cf0> (a
org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread)
"1692369893@qtp-1688716382-1 - Acceptor0
[email protected]:60030" prio=10 tid=0x00002aaab064e000
nid=0x2841 runnable [0x0000000042cf8000]
java.lang.Thread.State: RUNNABLE
at
org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:724)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
"DestroyJavaVM" prio=10 tid=0x00000000520e2000 nid=0x2808 waiting on
condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Low Memory Detector" daemon prio=10 tid=0x000000005216f800 nid=0x2811
runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x000000005216d800 nid=0x2810
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x0000000052167800 nid=0x280f
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x0000000052165800 nid=0x280e
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x0000000052145000 nid=0x280d in
Object.wait() [0x0000000041de9000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000007800a43e8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x00000007800a43e8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x0000000052143000 nid=0x280c
in Object.wait() [0x0000000041ce8000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00000007800a3618> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00000007800a3618> (a java.lang.ref.Reference$Lock)
"VM Thread" prio=10 tid=0x000000005213c000 nid=0x280b runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x00000000520f5000
nid=0x2809 runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x00000000520f7000
nid=0x280a runnable
"VM Periodic Task Thread" prio=10 tid=0x000000005217a000 nid=0x2812
waiting on condition
JNI global references: 974
Heap
PSYoungGen total 560640K, used 480899K [0x00000007d5560000,
0x0000000800000000, 0x0000000800000000)
eden space 490368K, 83% used
[0x00000007d5560000,0x00000007ee66ca20,0x00000007f3440000)
from space 70272K, 99% used
[0x00000007f3440000,0x00000007f78d4590,0x00000007f78e0000)
to space 110464K, 0% used
[0x00000007f9420000,0x00000007f9420000,0x0000000800000000)
PSOldGen total 266368K, used 155632K [0x0000000780000000,
0x0000000790420000, 0x00000007d5560000)
object space 266368K, 58% used
[0x0000000780000000,0x00000007897fc3c0,0x0000000790420000)
PSPermGen total 33216K, used 18585K [0x000000077ae00000,
0x000000077ce70000, 0x0000000780000000)
object space 33216K, 55% used
[0x000000077ae00000,0x000000077c0267c0,0x000000077ce70000)
Regards,
Bogdan
On Wed, Sep 21, 2011 at 7:06 PM, Stack <[email protected]> wrote:
> On Wed, Sep 21, 2011 at 1:07 AM, Bogdan Ghidireac <[email protected]> wrote:
>> I have an HBase 0.90.3 fleet with more than 100 hosts distributed
>> across 3 datacenters.
>> Yesterday, we experienced some network issues
>> for several minutes and the region servers from one data center lost
>> the connectivity with the namenode. They started the shutdown sequence
>> but about 20 hosts were unable to complete it successfully. This is
>> bad for us because we have to restart them manually.
>>
>
> HBase and HDFS are not made to span datacenters.
>
>> Is my assumption correct? Should I open a JIRA?
>>
>
> You could paste the thread dump but before doing anything, I'd suggest
> you change your layout so cluster runs inside a single datacenter.
>
> St.Ack
>