Hi, We are starting to investigate an issue where 1 tserver was up, but became slow/unresponsive for several hours, yet all writes to our 20+ servers began to fail. We could see leading up to the failure that the writes were distributed among all of the tablet servers, so it wasn't a hotspot. Whenever we receive a MutationsRejectedException, we recreate the BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter code, but any ideas what could cause this issue? Is there some sort of initialization or healthchecking that the client does where 1 server could impact all?
Thanks. -Mike Caused by: org.apache.accumulo.core.client.TimedOutException: Servers timed out [pnj-bvlt-r4n03.abc.com:31113] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177) ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182) ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933) ~[stormjar.jar:1.0] at
