In short, yes. This is mitigated by the fact the metadata table can be
split into many tablets. As such, not all tables would be affected by a
single metadata tablet being unreachable (Dave's solution helps here).
One possible solution which could be investigated is what HBase coined
as "Timeline-Consistent High Available Reads"[1]. Essentially, in
addition to the read-write Tablet (as is currently the case), there are
one to many read-only copies of a Tablet. This helps mitigate the case
where some data is unreachable due to TabletServer problems.
However, this idea does make me a little wary for use with the metadata
table.
Trying to figure out what happened on that node and get you a solution
would be my preferred path forward :)
[1] http://hbase.apache.org/book.html#arch.timelineconsistent.reads
Michael Moss wrote:
1.7.2 (client still 1.6.2).
I think its an overall design issue, no? Serving metadata is a SPOF?
On Fri, Sep 9, 2016 at 10:41 AM, Christopher <[email protected]
<mailto:[email protected]>> wrote:
What version of Accumulo? Could narrow down the search for known
issue potentials.
On Fri, Sep 9, 2016 at 10:36 AM Michael Moss <[email protected]
<mailto:[email protected]>> wrote:
Upon further internal discussion, it looks like the
metadata/root tables are served from the tservers (not an HA
master for example) and the one in question was serving it. It
was unable to run MajC (compaction) for many hours leading up to
the time where it couldn't service requests any longer, but it
was still up, hosting tablets, just very slow or unable to
respond. So all writes ended up timing out.
If this condition is possible and there is a SPOF here, it'd be
good to see what's on the roadmap to address it.
On Fri, Sep 9, 2016 at 10:24 AM, <[email protected]
<mailto:[email protected]>> wrote:
What was happening on that 1 tserver? Was it in garbage
collection? Was it having network or O/S issues?
------------------------------------------------------------------------
*From: *"Michael Moss (BLOOMBERG/ 731 LEX)"
<[email protected] <mailto:[email protected]>>
*To: *[email protected] <mailto:[email protected]>
*Sent: *Friday, September 9, 2016 9:40:42 AM
*Subject: *1 of 20 TServers unresponsive/slow, all writes fail?
Hi,
We are starting to investigate an issue where 1 tserver was
up, but became slow/unresponsive for several hours, yet all
writes to our 20+ servers began to fail. We could see
leading up to the failure that the writes were distributed
among all of the tablet servers, so it wasn't a hotspot.
Whenever we receive a MutationsRejectedException, we
recreate the BatchWriter (ACCUMULO-2990). I'm digging into
the TabletServerBatchWriter code, but any ideas what could
cause this issue? Is there some sort of initialization or
healthchecking that the client does where 1 server could
impact all?
Thanks.
-Mike
Caused by:
org.apache.accumulo.core.client.TimedOutException: Servers
timed out [pnj-bvlt-r4n03.abc.com:31113
<http://pnj-bvlt-r4n03.abc.com:31113>] at
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177)
~[stormjar.jar:1.0] at
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
~[stormjar.jar:1.0] at
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933)
~[stormjar.jar:1.0] at