The errors in the logs regarding 10.2.130.1, was this during the time it was rebooted, or before? Was the tserver on 10.2.130.1 hung? I think we would likely need more information from the logs to determine what is occurring during this time. It might be best to open an issue in JIRA for this.
> -----Original Message----- > From: [email protected] [mailto:[email protected]] On > Behalf Of Denis > Sent: Thursday, October 22, 2015 7:29 PM > To: [email protected] > Subject: Re: Tserver's strange state. > > The server 10.2.130.1 has been rebooted. > Yes, it is a production system with a lot of reads and writes. > > On 10/22/15, dlmarion <[email protected]> wrote: > > > > > > Are you trying to shut the whole system down, or just a couple of > > tablet servers?Is your application reading and writing from/to > > Accumulo during this time? > > > > > > > > > > -------- Original message -------- > > From: Denis <[email protected]> > > Date: 10/22/2015 6:03 PM (GMT-05:00) > > To: [email protected] > > Subject: Re: Tserver's strange state. > > > > Both servers has the errors in the logs like these: > > > > ======== > > 2015-10-22 03:28:00,599 ERROR > > org.apache.accumulo.core.client.impl.Writer: error sending update to > > 10.2.130.1:9997: org.apache.thrift.transport.TTransportException: > > java.net.SocketTimeoutException: 120000 millis timeout while waiting > > for channel to be ready for re ad. ch : > > java.nio.channels.SocketChannel[connected > > local=/10.2.142.1:36148 remote=/10.2.130.1:9997] > > 2015-10-22 03:28:04,283 ERROR > > org.apache.accumulo.core.client.impl.Writer: error sending update to > > 10.2.130.1:9997: org.apache.thrift.transport.TTransportException: > > java.net.SocketTimeoutException: 120000 millis timeout while waiting > > for channel to be ready for re ad. ch : > > java.nio.channels.SocketChannel[connected > > local=/10.2.142.1:37047 remote=/10.2.130.1:9997] > > 2015-10-22 03:28:06,116 ERROR > > org.apache.accumulo.core.client.impl.Writer: error sending update to > > 10.2.130.1:9997: org.apache.thrift.transport.TTransportException: > > java.net.SocketTimeoutException: 120000 millis timeout while waiting > > for channel to be ready for re ad. ch : > > java.nio.channels.SocketChannel[connected > > local=/10.2.142.1:37167 remote=/10.2.130.1:9997] ======== > > > > On 10/22/15, Denis <[email protected]> wrote: > >> Hi > >> > >> Sometimes my Tablet Servers go into a strange state: they have some > >> very old scans (see picture: http://i.imgur.com/2sOUM99.png) and > >> being in this state they cannot be decomissioned gracefully using > >> "accumulo stop" - number of their tablets decreases down to some > >> fixed number (say from 6K tablets to 2K), not to zero. > >> It is diffucult to reproduce. > >> Now I have a live system with 2 tabletservers in this state. > >> Any suggestions how to catch the bug? > >> > >
