(Running Hbase 0.90.0 on 700+ nodes.)
You may have seen many (or mostly all) of the following issues already:
1. HConnection.isTableAvailable: This doesn't seem to be working all the
time. In particular, I had this code after creating a table asynchronously:
do {
LOG.info("Table " + tableName + "not yet available... Sleeping for" +
sleepTime + "milliseconds...");
Thread.sleep(sleepTime);
} while (!conn.isTableAvailable(table.getTableName()));
LOG.info("Table is available!! : "+tableName+" Available?
"+conn.isTableAvailable(table.getTableName()));
It comes out of the loop but then I see this:
Table is available!! : <TABLE> Available? false
And then I see that not all the regions are yet available.
2. The master getting stuck unable to delete a WAL (I have seen this before
on this forum and a related JIRA on this one): We had worked around by manually
deleting a WAL. But during times when the master crashed during table creation
(with split key boundaries), the node that took over next as the master
(failover) started getting stuck for around 25% of the cluster. I had to wipe
out all the logs so that the master could start up right.
But even then, the regionservers which had suffered the log issue couldn't
recognize the failed over master. (Is this something that has been observed
before?)
3. createTableAsync with incorrect split keys: By mistake, I had some
duplicate keys in the split key byte array while calling the createTableAsync
function. The master crashed throwing a KeeperException (thanks to the
duplicate keys I guess?)
Also, can you let me know why createTableAsync blocks for some time and throws
a socket timeout exception when I try creating a table with a large number of
regions?
Thank you
Vidhya