Yup, I am indeed on HDP - thanks for the link. The services do log GSS
exceptions every ten hours, but seem to sufficiently recover
themselves. Having turned up logging on my client:

1) On client start, I see hadoop login messages
2) After 8 hours (0.8*10 hours) when the renewal is expected to take
place, I don't see any hadoop login messages
3) After 10 hours, I see GSS exceptions
4) After each GSS exception, I see an attempt to renew but using
ticket cache, rather than keytab.

Currently working on shortening the 10 hour expiry time so I can catch
it in a debugger!

Thanks,

James


On 13 July 2017 at 15:20, Josh Elser <els...@apache.org> wrote:
> If you're using Hortonworks' HDP, you would probably benefit from
> https://github.com/hortonworks/accumulo
>
> There is likely a git-tag for the exact version that you're running. The
> line numbers would match there.
>
> To be clear, if your services (e.g. TabletServers) aren't failing after
> 10hrs, you're not running into ACCUMULO-4069. Given my (limited)
> understanding, your problem is purely client-side. It's possible that the
> client-side RPC implementation isn't correctly handling the ticket re-login,
> but I know there is specifically code in there to handle the re-login case.
>
> The next step would be getting some debug logging from your application
> around UserGroupInformation or the JDK itself, or just spin up a trivial
> example with a small relogin window to reproduce the problem.
>
> On 7/12/17 3:48 PM, James Srinivasan wrote:
>>
>> Yup, I'm going to spin up a vanilla 1.7.0 (maybe newer) install too to
>> see if it behaves any differently. There is at least one patch
>> included in their distro that isn't in the formal documentation, plus
>> it makes matching line numbers in logs to src code rather difficult.
>>
>> Thanks,
>>
>> James
>>
>> On 12 July 2017 at 20:37, Sean Busbey <bus...@cloudera.com> wrote:
>>>
>>> Hi James!
>>>
>>> It sounds like you may need to chase things down with your vendor,
>>> since the precise combination of patches included will make looking at
>>> things hard for the community.
>>>
>>> On Wed, Jul 12, 2017 at 11:01 AM, James Srinivasan
>>> <james.sriniva...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> So I've fired off a thread to perform the periodic
>>>> checkTGTAndReloginFromKeytab call which seems to be running, but the
>>>> connection still fails with GSS errors after precisely 10 hours.
>>>>
>>>> While I am running 1.7.0, it seems the vendor included the
>>>> ACCUMULO-4069 patch, and immediately after the exception is thrown I
>>>> see a log entry "Performing ticket-cache-based Kerberos re-login".
>>>> However, it should be using a keytab - have turned up the logging to
>>>> 11 and will leave running overnight...
>>>>
>>>> James
>>>>
>>>> On 11 July 2017 at 16:17, Josh Elser <josh.el...@gmail.com> wrote:
>>>>>
>>>>> Nope, you've got it exactly right! That's the code I would've pointed
>>>>> you at
>>>>> to copy :)
>>>>>
>>>>> If/when you do get to long-running MR jobs, see the
>>>>> "general.delegation.token.*" configuration properties in this table[1].
>>>>> I
>>>>> think the docs are citing that one delegation token is valid for 7
>>>>> days, but
>>>>> it's been a long time since writing/testing that code.
>>>>>
>>>>> - Josh
>>>>>
>>>>> [1]
>>>>>
>>>>> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_server_configuration_2
>>>>>
>>>>> On 7/11/17 1:25 AM, James Srinivasan wrote:
>>>>>>
>>>>>>
>>>>>> Thanks both. I can't (easily) upgrade beyond 1.7.0, but have raised a
>>>>>> support case with our Hadoop distribution vendor.
>>>>>>
>>>>>> I'm not (yet) worried about expiration with MapReduce - for now I'll
>>>>>> try to keep such jobs to under 24h! Outside MR, sounds like I just
>>>>>> need to periodically call
>>>>>> UserGroupInformation.checkTGTAndReloginFromKeytab like
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://github.com/apache/accumulo/blob/master/server/base/src/main/java/org/apache/accumulo/server/security/SecurityUtil.java#L121
>>>>>>
>>>>>> Or is the TGT associated with an Accumulo KerberosToken separate?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>> On 11 July 2017 at 02:59, Josh Elser <josh.el...@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>> No, you are (likely) not running into ACCUMULO-4069. What you've
>>>>>>> described sounds like your client's ticket expired. Accumulo does not
>>>>>>> spawn any ticket renewal on the behalf of clients.
>>>>>>>
>>>>>>> Hadoop's UGI code will automatically spawn a renewal thread when you
>>>>>>> log in using a ticket cache. This does not happen automatically when
>>>>>>> you use a keytab (I have no explanation as to why this is). This is
>>>>>>> the most likely cause of your error and something you need to correct
>>>>>>> in your application (spawn a thread to renew your application's
>>>>>>> ticket).
>>>>>>>
>>>>>>> If you are using MapReduce, you have yet another layer of indirection
>>>>>>> with DelegationTokens, but that's probably not what you're seeing (as
>>>>>>> DelegationTokens don't actually have a Kerberos TGT).
>>>>>>>
>>>>>>> On Mon, Jul 10, 2017 at 5:42 PM, Christopher <ctubb...@apache.org>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> It certainly sounds like the same issue. I'd recommend upgrading to
>>>>>>>> the
>>>>>>>> latest 1.7.3 (currently the latest 1.7 version) to include all the
>>>>>>>> bugs
>>>>>>>> we've found and fixed in that release line.
>>>>>>>>
>>>>>>>> On Mon, Jul 10, 2017 at 5:50 AM James Srinivasan
>>>>>>>> <james.sriniva...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'm using Accumulo 1.7.0 and finding that after some period of time
>>>>>>>>> (>8 hours, <3 days - happened over the weekend) my ingest fails
>>>>>>>>> with
>>>>>>>>> errors regarding "Failed to find any Kerberos tgt". My guess is
>>>>>>>>> that
>>>>>>>>> the ticket from the keytab has expired, and needs to be renewed -
>>>>>>>>> from
>>>>>>>>> memory, I had seen a Kerberos tgt renewer thread running in my
>>>>>>>>> client,
>>>>>>>>> so assumed it happened automagically. Is that the case? Perhaps I
>>>>>>>>> am
>>>>>>>>> hitting this bug?
>>>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-4069
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> James
>>>
>>>
>>>
>>>
>>> --
>>> busbey

Reply via email to