Josef,

If these are intermittent failures, you might consider turning on the
watcher [1] to automatically restart your processes. This should keep your
cluster from atrophying over time. You'll still have to take administrative
action to fix the DNS problem, but your availability should be better.

Cheers,
Adam

[1] http://accumulo.apache.org/1.7/accumulo_user_manual.html#watcher

On Fri, Nov 13, 2015 at 6:57 AM, Josef Roehrl - PHEMI <jroe...@phemi.com>
wrote:

> Hi Everyone,
>
> Turns out that it was a DNS server issue exactly.  Had to get this
> confirmed by the Data Centre, though.
>
> Thanks!
>
> On Fri, Nov 13, 2015 at 12:25 PM, Josef Roehrl - PHEMI <jroe...@phemi.com>
> wrote:
>
>> Hi All,
>>
>> 3 times in the past few weeks (twice on 1 system, once on another), the
>> master gets UnknownHostException (s), one by one, for each of the tablet
>> servers.  Then, it wants to stop them. Eventually, all the tablet servers
>> quit.
>>
>> It goes like this for all the tablet servers:
>>
>> 12 08:14:01,0498tserver:620
>> ERROR
>>
>> error sending update to tserver3:9997: 
>> org.apache.thrift.transport.TTransportException: 
>> java.net.UnknownHostException
>>
>> 12 09:01:53,0352master:12
>> ERROR
>>
>> org.apache.thrift.transport.TTransportException: 
>> java.net.UnknownHostException
>>
>> 12 16:35:50,0672master:110
>> ERROR
>>
>> unable to get tablet server status tserver3:9997[250e6cd2c500012] 
>> org.apache.thrift.transport.TTransportException: 
>> java.net.UnknownHostException
>>
>>
>>
>> I've redacted the real host names, of course.
>>
>> This could be a DNS problem, though the system was running fine for days
>> before this happened (same scenario on the 2 systems with really quite
>> different DNS servers).
>>
>> If any one has a hint or seen something like this, I would appreciate any
>> pointers.
>>
>> I have looked at the JIRA issues regarding DNS outages, but nothing seems
>> to fit this pattern.
>>
>> Thanks
>>
>> --
>>
>>
>> Josef Roehrl
>> Senior Software Developer
>> *PHEMI Systems*
>> 180-887 Great Northern Way
>> Vancouver, BC V5T 4T5
>> 604-336-1119
>> Website <http://www.phemi.com/> Twitter
>> <https://twitter.com/PHEMISystems> Linkedin
>> <http://www.linkedin.com/company/3561810?trk=tyah&amp;trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>
>>
>>
>>
>
>
> --
>
>
> Josef Roehrl
> Senior Software Developer
> *PHEMI Systems*
> 180-887 Great Northern Way
> Vancouver, BC V5T 4T5
> 604-336-1119
> Website <http://www.phemi.com/> Twitter <https://twitter.com/PHEMISystems>
>  Linkedin
> <http://www.linkedin.com/company/3561810?trk=tyah&amp;trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>
>
>
>

Reply via email to