Hello all, Tonight in an AWS outtage we lost 11 out of 51 regionservers. All HMasters were unaffected, but the current active master continually spammed messages like this:
12/06/30 00:07:22 INFO ipc.Client: Retrying connect to server: / 10.125.18.129:50020. Already tried 14 time(s). It was not following through the splitting of HLog files and didn't appear to be moving regions off failed hosts. After giving it about 20 minutes to try to right itself, I tried restarting the service. The restart script just hung for a while printing dots and nothing apparent was happening on the logs at the time. Finally I kill -9 the process, so that another master could take over. The new master seemed to start splitting logs, but eventually got into the same state of printing the above message. Eventually it all worked out, but it took WAY too long (almost an hour, all said). Is this something that is tunable? They should have instantly been removed from the list instead of retrying so many times. Each server was retried upwards of 30-40 times. I am running cdh3u2 (0.90.4). Thanks, Bryan
