While running with dfs.client.read.shortcircuit set to true I ran into an OOM on a region server that subsequently died.

Probably this was due to too little direct memory config.

However, after bringing the cluster up again one region of a table got stuck in transtion. More specifically the master says:

---
6400e1626085724ae20b2a6fa1914db8tt_locks,,1461919149434.6400e1626085724ae20b2a6fa1914db8. state=FAILED_CLOSE, ts=Tue May 10 17:58:29 CEST 2016 (0s ago), server=hb-desktop,16201,1462895637261
---

Running hbase hbck

I get:

---
ERROR: Region { meta => tt_locks,,1461919149434.6400e1626085724ae20b2a6fa1914db8., hdfs => hdfs://localhost:9000/hbase/data/default/tt_locks/6400e1626085724ae20b2a6fa1914db8, deployed => , replicaId => 0 } not deployed on any region server. ERROR: There is a hole in the region chain between and . You need to create a new .regioninfo and region dir in hdfs to plug the hole.
---

But the all tables are listed as "ok".

Any attempt to repair seems to have no effect. Worse, the region server is trying like crazy to get that region opened and runs into an OOM after a few minutes.

(It keeps saying "Started memstore flush for..." but never seems to get anywhere).

There is very little load really: 76 regions, 212 store files and I allowed for 1.5G heap and 1.5G direct memory.

After disabling dfs.client.read.shortcircuit at least there is no OOM anymore.

I have the vague suspicion that that stupid region should be simply dropped, but I have no idea how to fix this.

As we will go into production with this system shortly, any help would be great!!

Thanks,
Henning


Reply via email to