Hmm, so it seems updating the Hadoop version used by my processor from
2.6.0 to 2.7.3 has fixed the problem. Testing a little more just to
make sure...

On 14 July 2017 at 13:39, James Srinivasan <[email protected]> wrote:
> So when my code runs in a NiFi processor, the initial keytab
> authentication works fine but following that it seems to think keytabs
> aren't in use (UserGroupInformation.getCurrentUser.isFromKeytab is
> false), which explains why the renewal code never actually runs and
> why re-login is attempted using the ticket cache after the GSS
> exception. Over to the NiFi list I think...
>
> Making some progress!
>
> On 13 July 2017 at 18:28, Josh Elser <[email protected]> wrote:
>> Aha! That's an interesting wrinkle :)
>>
>> I have more experience with NiFi's use of Kerberos than I care to admit (due
>> to some folks who work in the physical office I do); I'm not aware of
>> anything that NiFi does which would cause problems, but that may be a
>> relevant detail.
>>
>> After I thought about it some more (to your #2 point): there's a little
>> failsafe in the Accumulo client implementation that, upon a SASL
>> authentication failure, it will attempt a relogin via Kerberos. This should
>> "catch" the cases where your client application is using a ticket cache
>> (because convention on the ticket cache location lets the jGSS client
>> library in Java itself do the relogin whereas Java doesn't know which keytab
>> to use). Still though -- a thread as you describe in #1 should have an
>> equivalent net-effect..
>>
>> On 7/13/17 11:45 AM, James Srinivasan wrote:
>>>
>>> Thanks, just checked that and it does seem renewable (tested using
>>> kinit -R). I'm running my code in two separate scenarios:
>>>
>>> 1) As part of a NiFi processor, which currently makes multiple
>>> Accumulo connections using the same keytab, each of which currently
>>> has a separate renewer thread
>>> 2) As part of a simple command line application - this seems to have
>>> no problem running for > 10 hours (even before I added the periodic
>>> renewal code)
>>>
>>> Will add extra logging to #2 and try to shorten the expiry from 10
>>> hours to 1 so I can see any difference in output.
>>>
>>> James
>>>
>>> On 13 July 2017 at 16:05, Josh Elser <[email protected]> wrote:
>>>>
>>>> It also may be worth mentioning to check the principal's configuration
>>>> that
>>>> you're using in your client. Depending on which you're using and how it
>>>> was
>>>> created, it may not actually support renewals.
>>>>
>>>> A quick test is to just `kinit` and then `kinit -R`. You can view the
>>>> explicit "configuration" for a principal using the `kadmin` console and
>>>> the
>>>> `getprinc <principal>` command. Be sure to check the krbtgt/<REALM>
>>>> principal as well:
>>>>
>>>> e.g.
>>>>
>>>> kadmin.local:  getprinc jelser
>>>> Principal: [email protected]
>>>> Maximum ticket life: 1 day 00:00:00
>>>> Maximum renewable life: 7 days 00:00:00
>>>>
>>>> kadmin.local:  getprinc krbtgt/EXAMPLE.COM
>>>> Principal: krbtgt/[email protected]
>>>> Maximum ticket life: 1 day 00:00:00
>>>> Maximum renewable life: 7 days 00:00:00
>>>>
>>>> If the krbtgt/$REALM principal does not have a non-zero renewable
>>>> lifetime,
>>>> any other principals created in that realm would also not be allowed to
>>>> be
>>>> renewed. Since you have the working "service" principals, you can
>>>> cross-check those.
>>>>
>>>> On 7/13/17 10:56 AM, James Srinivasan wrote:
>>>>>
>>>>>
>>>>> Yup, I am indeed on HDP - thanks for the link. The services do log GSS
>>>>> exceptions every ten hours, but seem to sufficiently recover
>>>>> themselves. Having turned up logging on my client:
>>>>>
>>>>> 1) On client start, I see hadoop login messages
>>>>> 2) After 8 hours (0.8*10 hours) when the renewal is expected to take
>>>>> place, I don't see any hadoop login messages
>>>>> 3) After 10 hours, I see GSS exceptions
>>>>> 4) After each GSS exception, I see an attempt to renew but using
>>>>> ticket cache, rather than keytab.
>>>>>
>>>>> Currently working on shortening the 10 hour expiry time so I can catch
>>>>> it in a debugger!
>>>>>
>>>>> Thanks,
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> On 13 July 2017 at 15:20, Josh Elser <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> If you're using Hortonworks' HDP, you would probably benefit from
>>>>>> https://github.com/hortonworks/accumulo
>>>>>>
>>>>>> There is likely a git-tag for the exact version that you're running.
>>>>>> The
>>>>>> line numbers would match there.
>>>>>>
>>>>>> To be clear, if your services (e.g. TabletServers) aren't failing after
>>>>>> 10hrs, you're not running into ACCUMULO-4069. Given my (limited)
>>>>>> understanding, your problem is purely client-side. It's possible that
>>>>>> the
>>>>>> client-side RPC implementation isn't correctly handling the ticket
>>>>>> re-login,
>>>>>> but I know there is specifically code in there to handle the re-login
>>>>>> case.
>>>>>>
>>>>>> The next step would be getting some debug logging from your application
>>>>>> around UserGroupInformation or the JDK itself, or just spin up a
>>>>>> trivial
>>>>>> example with a small relogin window to reproduce the problem.
>>>>>>
>>>>>> On 7/12/17 3:48 PM, James Srinivasan wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Yup, I'm going to spin up a vanilla 1.7.0 (maybe newer) install too to
>>>>>>> see if it behaves any differently. There is at least one patch
>>>>>>> included in their distro that isn't in the formal documentation, plus
>>>>>>> it makes matching line numbers in logs to src code rather difficult.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>> On 12 July 2017 at 20:37, Sean Busbey <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi James!
>>>>>>>>
>>>>>>>> It sounds like you may need to chase things down with your vendor,
>>>>>>>> since the precise combination of patches included will make looking
>>>>>>>> at
>>>>>>>> things hard for the community.
>>>>>>>>
>>>>>>>> On Wed, Jul 12, 2017 at 11:01 AM, James Srinivasan
>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> So I've fired off a thread to perform the periodic
>>>>>>>>> checkTGTAndReloginFromKeytab call which seems to be running, but the
>>>>>>>>> connection still fails with GSS errors after precisely 10 hours.
>>>>>>>>>
>>>>>>>>> While I am running 1.7.0, it seems the vendor included the
>>>>>>>>> ACCUMULO-4069 patch, and immediately after the exception is thrown I
>>>>>>>>> see a log entry "Performing ticket-cache-based Kerberos re-login".
>>>>>>>>> However, it should be using a keytab - have turned up the logging to
>>>>>>>>> 11 and will leave running overnight...
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>> On 11 July 2017 at 16:17, Josh Elser <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Nope, you've got it exactly right! That's the code I would've
>>>>>>>>>> pointed
>>>>>>>>>> you at
>>>>>>>>>> to copy :)
>>>>>>>>>>
>>>>>>>>>> If/when you do get to long-running MR jobs, see the
>>>>>>>>>> "general.delegation.token.*" configuration properties in this
>>>>>>>>>> table[1].
>>>>>>>>>> I
>>>>>>>>>> think the docs are citing that one delegation token is valid for 7
>>>>>>>>>> days, but
>>>>>>>>>> it's been a long time since writing/testing that code.
>>>>>>>>>>
>>>>>>>>>> - Josh
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_server_configuration_2
>>>>>>>>>>
>>>>>>>>>> On 7/11/17 1:25 AM, James Srinivasan wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks both. I can't (easily) upgrade beyond 1.7.0, but have
>>>>>>>>>>> raised
>>>>>>>>>>> a
>>>>>>>>>>> support case with our Hadoop distribution vendor.
>>>>>>>>>>>
>>>>>>>>>>> I'm not (yet) worried about expiration with MapReduce - for now
>>>>>>>>>>> I'll
>>>>>>>>>>> try to keep such jobs to under 24h! Outside MR, sounds like I just
>>>>>>>>>>> need to periodically call
>>>>>>>>>>> UserGroupInformation.checkTGTAndReloginFromKeytab like
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/accumulo/blob/master/server/base/src/main/java/org/apache/accumulo/server/security/SecurityUtil.java#L121
>>>>>>>>>>>
>>>>>>>>>>> Or is the TGT associated with an Accumulo KerberosToken separate?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> James
>>>>>>>>>>>
>>>>>>>>>>> On 11 July 2017 at 02:59, Josh Elser <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> No, you are (likely) not running into ACCUMULO-4069. What you've
>>>>>>>>>>>> described sounds like your client's ticket expired. Accumulo does
>>>>>>>>>>>> not
>>>>>>>>>>>> spawn any ticket renewal on the behalf of clients.
>>>>>>>>>>>>
>>>>>>>>>>>> Hadoop's UGI code will automatically spawn a renewal thread when
>>>>>>>>>>>> you
>>>>>>>>>>>> log in using a ticket cache. This does not happen automatically
>>>>>>>>>>>> when
>>>>>>>>>>>> you use a keytab (I have no explanation as to why this is). This
>>>>>>>>>>>> is
>>>>>>>>>>>> the most likely cause of your error and something you need to
>>>>>>>>>>>> correct
>>>>>>>>>>>> in your application (spawn a thread to renew your application's
>>>>>>>>>>>> ticket).
>>>>>>>>>>>>
>>>>>>>>>>>> If you are using MapReduce, you have yet another layer of
>>>>>>>>>>>> indirection
>>>>>>>>>>>> with DelegationTokens, but that's probably not what you're seeing
>>>>>>>>>>>> (as
>>>>>>>>>>>> DelegationTokens don't actually have a Kerberos TGT).
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jul 10, 2017 at 5:42 PM, Christopher
>>>>>>>>>>>> <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It certainly sounds like the same issue. I'd recommend upgrading
>>>>>>>>>>>>> to
>>>>>>>>>>>>> the
>>>>>>>>>>>>> latest 1.7.3 (currently the latest 1.7 version) to include all
>>>>>>>>>>>>> the
>>>>>>>>>>>>> bugs
>>>>>>>>>>>>> we've found and fixed in that release line.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jul 10, 2017 at 5:50 AM James Srinivasan
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm using Accumulo 1.7.0 and finding that after some period of
>>>>>>>>>>>>>> time
>>>>>>>>>>>>>> (>8 hours, <3 days - happened over the weekend) my ingest fails
>>>>>>>>>>>>>> with
>>>>>>>>>>>>>> errors regarding "Failed to find any Kerberos tgt". My guess is
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> the ticket from the keytab has expired, and needs to be renewed
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> from
>>>>>>>>>>>>>> memory, I had seen a Kerberos tgt renewer thread running in my
>>>>>>>>>>>>>> client,
>>>>>>>>>>>>>> so assumed it happened automagically. Is that the case? Perhaps
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> am
>>>>>>>>>>>>>> hitting this bug?
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-4069
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> James
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> busbey

Reply via email to