Hmm.
The profiler is enabled when you see this?
Something is way off with the last of the threads showing in your thread dump:
"regionserver60020.cacheFlusher":
at java.util.ResourceBundle.endLoading(ResourceBundle.java:1506)
- waiting to lock <0x00007fe7cbacbd48> (a
org.apache.hadoop.hbase.regionserver.MemStoreFlusher)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1379)
at java.util.ResourceBundle.findBundle(ResourceBundle.java:1292)
at
java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1234)
at java.util.ResourceBundle.getBundle(ResourceBundle.java:832)
....
How is it that we are trying to get into a synchronized hbase method,
MemStoreFlusher, in the depths of an i18n call; we're trying to append
a locale-appropriate date to a String.
Something is way off?
St.Ack
2011/4/28 Zhoushuaifeng <[email protected]>:
> I rechecked this, and maybe it's not so bad.
> I read the notes of the lock, like this:
> Acquires the lock if it is not held by another thread and returns
> immediately, setting the lock hold count to one.
>
> If the current thread already holds the lock then the hold count is
> incremented by one and the method returns immediately.
>
> If the lock is held by another thread then the current thread becomes
> disabled for thread scheduling purposes and lies dormant until the lock has
> been acquired, at which time the lock hold count is set to one.
>
> Specified by: lock() in Lock
>
> Put op calling lock() like this:
> this.cacheFlusher.reclaimMemStoreMemory();
> So, it's still locked by cacheflusher. If so, it's locked by the same thread
> (cacheFlusher), and can be locked at the same time with flushRegion.
> I'm not so familiar with ReentrantLock, please check if I'm write. If not,
> this is a critical priority issue.
>
>
> Zhou Shuaifeng(Frank)
>
>
> -----邮件原件-----
> 发件人: [email protected] [mailto:[email protected]] 代表 Stack
> 发送时间: 2011年4月29日 11:48
> 收件人: [email protected]
> 抄送: Yanlijun; Chenjian
> 主题: Re: found one deadlock on hbase?
>
> Yes. The below looks viable (though strange we have not seen it up to
> this). The profiler may have slowed things to bring on the deadlock
> -- or the run up to the high water mark -- but its still a deadlock.
> Please file a critical priority issue.
>
> If you have a patch, that'd be excellent.
>
> Thanks for digging in on this,
> St.Ack
>
>
> 2011/4/28 Zhoushuaifeng <[email protected]>:
>> Thanks, I will do more test.
>> Maybe the deadlock hapened like this? Please point it out if it's wrong.
>>
>> 1,One handler is handling put op, and reclaimMemStoreMemory, but the memory
>> is isAboveHighWaterMark, so this handler locked the memstoreflusher until
>> global mem is lower:
>>
>> public synchronized void reclaimMemStoreMemory() {
>> if (isAboveHighWaterMark()) {
>> lock.lock();
>> try {
>> while (isAboveHighWaterMark() && !server.isStopped()) {
>> wakeupFlushThread();
>> try {
>> // we should be able to wait forever, but we've seen a bug where
>> // we miss a notify, so put a 5 second bound on it at least.
>> flushOccurred.await(5, TimeUnit.SECONDS);
>> } catch (InterruptedException ie) {
>> Thread.currentThread().interrupt();
>> }
>> }
>> } finally {
>> lock.unlock();
>>
>> 2, flushforGlobalPressure is trigered, but to flush the memstore, it needed
>> to lock the memstoreflusher:
>>
>> private boolean flushRegion(final HRegion region, final boolean
>> emergencyFlush) {
>> synchronized (this.regionsInQueue) {
>> FlushRegionEntry fqe = this.regionsInQueue.remove(region);
>> if (fqe != null && emergencyFlush) {
>> // Need to remove from region from delay queue. When NOT an
>> // emergencyFlush, then item was removed via a flushQueue.poll.
>> flushQueue.remove(fqe);
>> }
>> lock.lock();
>> }
>>
>> 3, because lock is locked by the ipchandler of put op, the flushRegion will
>> never get the lock and flush will never happen.
>> 4, no flush, memory stay in AboveHighWaterMark state, and never unlock, so,
>> deadlock happend.
>>
>> Is it right?
>>
>> Zhou Shuaifeng(Frank)
>>
>