bq. timing out trying to obtain write locks on rows in that region. Can you confirm that the region under contention was the one being major compacted ?
Can you pastebin thread dump so that we can have better idea of the scenario ? For the region being compacted, how long would the compaction take (just want to see if there was correlation between this duration and timeout) ? Cheers On Wed, Feb 28, 2018 at 6:31 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > Hi, > > We are running on Amazon EMR based HBase 1.4.0 . We are currently seeing a > situation where sometimes a particular region gets into a situation where a > lot of write requests to any row in that region timeout saying they failed > to obtain a lock on a row in a region and eventually they experience an IPC > timeout. This causes the IPC queue to blow up in size as requests get > backed up, and that region server experiences a much higher than normal > timeout rate for all requests, not just those timing out for failing to > obtain the row lock. > > The strange thing is the rows are always different but the region is always > the same. So the question is, is there a region component to how long a row > write lock would be held? I looked at the debug dump and the RowLocks > section shows a long list of write row locks held, all of them are from the > same region but different rows. > > Will trying to obtain a write row lock experience delays if no one else > holds a lock on the same row but the region itself is experiencing read > delays? We do have an incremental compaction tool running that major > compacts one region per region server at a time, so that will drive out > pages from the bucket cache. But for most regions the impact is > transitional until the bucket cache gets populated by pages from the new > HFile. But for this one region we start timing out trying to obtain write > locks on rows in that region. > > Any insight anyone can provide would be most welcome. > > Cheers. > > ---- > Saad >