Hi,

We are running on Amazon EMR based HBase 1.4.0 . We are currently seeing a
situation where sometimes a particular region gets into a situation where a
lot of write requests to any row in that region timeout saying they failed
to obtain a lock on a row in a region and eventually they experience an IPC
timeout. This causes the IPC queue to blow up in size as requests get
backed up, and that region server experiences a much higher than normal
timeout rate for all requests, not just those timing out for failing to
obtain the row lock.

The strange thing is the rows are always different but the region is always
the same. So the question is, is there a region component to how long a row
write lock would be held? I looked at the debug dump and the RowLocks
section shows a long list of write row locks held, all of them are from the
same region but different rows.

Will trying to obtain a write row lock experience delays if no one else
holds a lock on the same row but the region itself is experiencing read
delays? We do have an incremental compaction tool running that major
compacts one region per region server at a time, so that will drive out
pages from the bucket cache. But for most regions the impact is
transitional until the bucket cache gets populated by pages from the new
HFile. But for this one region we start timing out trying to obtain write
locks on rows in that region.

Any insight anyone can provide would be most welcome.

Cheers.

----
Saad

Reply via email to