I did not manage to investigate and fix it yet. Today the same problem hit again: one machine with HDFS datanode was rebooted and 3 Accumulo TServers had the oldest scan correlated with the time of datanode restart.
On 10/23/15, [email protected] <[email protected]> wrote: > The errors in the logs regarding 10.2.130.1, was this during the time it was > rebooted, or before? Was the tserver on 10.2.130.1 hung? I think we would > likely need more information from the logs to determine what is occurring > during this time. It might be best to open an issue in JIRA for this. > >> -----Original Message----- >> From: [email protected] [mailto:[email protected]] On >> Behalf Of Denis >> Sent: Thursday, October 22, 2015 7:29 PM >> To: [email protected] >> Subject: Re: Tserver's strange state. >> >> The server 10.2.130.1 has been rebooted. >> Yes, it is a production system with a lot of reads and writes. >> >> On 10/22/15, dlmarion <[email protected]> wrote: >> > >> > >> > Are you trying to shut the whole system down, or just a couple of >> > tablet servers?Is your application reading and writing from/to >> > Accumulo during this time? >> > >> > >> > >> > >> > -------- Original message -------- >> > From: Denis <[email protected]> >> > Date: 10/22/2015 6:03 PM (GMT-05:00) >> > To: [email protected] >> > Subject: Re: Tserver's strange state. >> > >> > Both servers has the errors in the logs like these: >> > >> > ======== >> > 2015-10-22 03:28:00,599 ERROR >> > org.apache.accumulo.core.client.impl.Writer: error sending update to >> > 10.2.130.1:9997: org.apache.thrift.transport.TTransportException: >> > java.net.SocketTimeoutException: 120000 millis timeout while waiting >> > for channel to be ready for re ad. ch : >> > java.nio.channels.SocketChannel[connected >> > local=/10.2.142.1:36148 remote=/10.2.130.1:9997] >> > 2015-10-22 03:28:04,283 ERROR >> > org.apache.accumulo.core.client.impl.Writer: error sending update to >> > 10.2.130.1:9997: org.apache.thrift.transport.TTransportException: >> > java.net.SocketTimeoutException: 120000 millis timeout while waiting >> > for channel to be ready for re ad. ch : >> > java.nio.channels.SocketChannel[connected >> > local=/10.2.142.1:37047 remote=/10.2.130.1:9997] >> > 2015-10-22 03:28:06,116 ERROR >> > org.apache.accumulo.core.client.impl.Writer: error sending update to >> > 10.2.130.1:9997: org.apache.thrift.transport.TTransportException: >> > java.net.SocketTimeoutException: 120000 millis timeout while waiting >> > for channel to be ready for re ad. ch : >> > java.nio.channels.SocketChannel[connected >> > local=/10.2.142.1:37167 remote=/10.2.130.1:9997] ======== >> > >> > On 10/22/15, Denis <[email protected]> wrote: >> >> Hi >> >> >> >> Sometimes my Tablet Servers go into a strange state: they have some >> >> very old scans (see picture: http://i.imgur.com/2sOUM99.png) and >> >> being in this state they cannot be decomissioned gracefully using >> >> "accumulo stop" - number of their tablets decreases down to some >> >> fixed number (say from 6K tablets to 2K), not to zero. >> >> It is diffucult to reproduce. >> >> Now I have a live system with 2 tabletservers in this state. >> >> Any suggestions how to catch the bug? >> >> >> > > >
