Hmm, so it seems updating the Hadoop version used by my processor from 2.6.0 to 2.7.3 has fixed the problem. Testing a little more just to make sure...
On 14 July 2017 at 13:39, James Srinivasan <[email protected]> wrote: > So when my code runs in a NiFi processor, the initial keytab > authentication works fine but following that it seems to think keytabs > aren't in use (UserGroupInformation.getCurrentUser.isFromKeytab is > false), which explains why the renewal code never actually runs and > why re-login is attempted using the ticket cache after the GSS > exception. Over to the NiFi list I think... > > Making some progress! > > On 13 July 2017 at 18:28, Josh Elser <[email protected]> wrote: >> Aha! That's an interesting wrinkle :) >> >> I have more experience with NiFi's use of Kerberos than I care to admit (due >> to some folks who work in the physical office I do); I'm not aware of >> anything that NiFi does which would cause problems, but that may be a >> relevant detail. >> >> After I thought about it some more (to your #2 point): there's a little >> failsafe in the Accumulo client implementation that, upon a SASL >> authentication failure, it will attempt a relogin via Kerberos. This should >> "catch" the cases where your client application is using a ticket cache >> (because convention on the ticket cache location lets the jGSS client >> library in Java itself do the relogin whereas Java doesn't know which keytab >> to use). Still though -- a thread as you describe in #1 should have an >> equivalent net-effect.. >> >> On 7/13/17 11:45 AM, James Srinivasan wrote: >>> >>> Thanks, just checked that and it does seem renewable (tested using >>> kinit -R). I'm running my code in two separate scenarios: >>> >>> 1) As part of a NiFi processor, which currently makes multiple >>> Accumulo connections using the same keytab, each of which currently >>> has a separate renewer thread >>> 2) As part of a simple command line application - this seems to have >>> no problem running for > 10 hours (even before I added the periodic >>> renewal code) >>> >>> Will add extra logging to #2 and try to shorten the expiry from 10 >>> hours to 1 so I can see any difference in output. >>> >>> James >>> >>> On 13 July 2017 at 16:05, Josh Elser <[email protected]> wrote: >>>> >>>> It also may be worth mentioning to check the principal's configuration >>>> that >>>> you're using in your client. Depending on which you're using and how it >>>> was >>>> created, it may not actually support renewals. >>>> >>>> A quick test is to just `kinit` and then `kinit -R`. You can view the >>>> explicit "configuration" for a principal using the `kadmin` console and >>>> the >>>> `getprinc <principal>` command. Be sure to check the krbtgt/<REALM> >>>> principal as well: >>>> >>>> e.g. >>>> >>>> kadmin.local: getprinc jelser >>>> Principal: [email protected] >>>> Maximum ticket life: 1 day 00:00:00 >>>> Maximum renewable life: 7 days 00:00:00 >>>> >>>> kadmin.local: getprinc krbtgt/EXAMPLE.COM >>>> Principal: krbtgt/[email protected] >>>> Maximum ticket life: 1 day 00:00:00 >>>> Maximum renewable life: 7 days 00:00:00 >>>> >>>> If the krbtgt/$REALM principal does not have a non-zero renewable >>>> lifetime, >>>> any other principals created in that realm would also not be allowed to >>>> be >>>> renewed. Since you have the working "service" principals, you can >>>> cross-check those. >>>> >>>> On 7/13/17 10:56 AM, James Srinivasan wrote: >>>>> >>>>> >>>>> Yup, I am indeed on HDP - thanks for the link. The services do log GSS >>>>> exceptions every ten hours, but seem to sufficiently recover >>>>> themselves. Having turned up logging on my client: >>>>> >>>>> 1) On client start, I see hadoop login messages >>>>> 2) After 8 hours (0.8*10 hours) when the renewal is expected to take >>>>> place, I don't see any hadoop login messages >>>>> 3) After 10 hours, I see GSS exceptions >>>>> 4) After each GSS exception, I see an attempt to renew but using >>>>> ticket cache, rather than keytab. >>>>> >>>>> Currently working on shortening the 10 hour expiry time so I can catch >>>>> it in a debugger! >>>>> >>>>> Thanks, >>>>> >>>>> James >>>>> >>>>> >>>>> On 13 July 2017 at 15:20, Josh Elser <[email protected]> wrote: >>>>>> >>>>>> >>>>>> If you're using Hortonworks' HDP, you would probably benefit from >>>>>> https://github.com/hortonworks/accumulo >>>>>> >>>>>> There is likely a git-tag for the exact version that you're running. >>>>>> The >>>>>> line numbers would match there. >>>>>> >>>>>> To be clear, if your services (e.g. TabletServers) aren't failing after >>>>>> 10hrs, you're not running into ACCUMULO-4069. Given my (limited) >>>>>> understanding, your problem is purely client-side. It's possible that >>>>>> the >>>>>> client-side RPC implementation isn't correctly handling the ticket >>>>>> re-login, >>>>>> but I know there is specifically code in there to handle the re-login >>>>>> case. >>>>>> >>>>>> The next step would be getting some debug logging from your application >>>>>> around UserGroupInformation or the JDK itself, or just spin up a >>>>>> trivial >>>>>> example with a small relogin window to reproduce the problem. >>>>>> >>>>>> On 7/12/17 3:48 PM, James Srinivasan wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Yup, I'm going to spin up a vanilla 1.7.0 (maybe newer) install too to >>>>>>> see if it behaves any differently. There is at least one patch >>>>>>> included in their distro that isn't in the formal documentation, plus >>>>>>> it makes matching line numbers in logs to src code rather difficult. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> James >>>>>>> >>>>>>> On 12 July 2017 at 20:37, Sean Busbey <[email protected]> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi James! >>>>>>>> >>>>>>>> It sounds like you may need to chase things down with your vendor, >>>>>>>> since the precise combination of patches included will make looking >>>>>>>> at >>>>>>>> things hard for the community. >>>>>>>> >>>>>>>> On Wed, Jul 12, 2017 at 11:01 AM, James Srinivasan >>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> So I've fired off a thread to perform the periodic >>>>>>>>> checkTGTAndReloginFromKeytab call which seems to be running, but the >>>>>>>>> connection still fails with GSS errors after precisely 10 hours. >>>>>>>>> >>>>>>>>> While I am running 1.7.0, it seems the vendor included the >>>>>>>>> ACCUMULO-4069 patch, and immediately after the exception is thrown I >>>>>>>>> see a log entry "Performing ticket-cache-based Kerberos re-login". >>>>>>>>> However, it should be using a keytab - have turned up the logging to >>>>>>>>> 11 and will leave running overnight... >>>>>>>>> >>>>>>>>> James >>>>>>>>> >>>>>>>>> On 11 July 2017 at 16:17, Josh Elser <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Nope, you've got it exactly right! That's the code I would've >>>>>>>>>> pointed >>>>>>>>>> you at >>>>>>>>>> to copy :) >>>>>>>>>> >>>>>>>>>> If/when you do get to long-running MR jobs, see the >>>>>>>>>> "general.delegation.token.*" configuration properties in this >>>>>>>>>> table[1]. >>>>>>>>>> I >>>>>>>>>> think the docs are citing that one delegation token is valid for 7 >>>>>>>>>> days, but >>>>>>>>>> it's been a long time since writing/testing that code. >>>>>>>>>> >>>>>>>>>> - Josh >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_server_configuration_2 >>>>>>>>>> >>>>>>>>>> On 7/11/17 1:25 AM, James Srinivasan wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks both. I can't (easily) upgrade beyond 1.7.0, but have >>>>>>>>>>> raised >>>>>>>>>>> a >>>>>>>>>>> support case with our Hadoop distribution vendor. >>>>>>>>>>> >>>>>>>>>>> I'm not (yet) worried about expiration with MapReduce - for now >>>>>>>>>>> I'll >>>>>>>>>>> try to keep such jobs to under 24h! Outside MR, sounds like I just >>>>>>>>>>> need to periodically call >>>>>>>>>>> UserGroupInformation.checkTGTAndReloginFromKeytab like >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://github.com/apache/accumulo/blob/master/server/base/src/main/java/org/apache/accumulo/server/security/SecurityUtil.java#L121 >>>>>>>>>>> >>>>>>>>>>> Or is the TGT associated with an Accumulo KerberosToken separate? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> James >>>>>>>>>>> >>>>>>>>>>> On 11 July 2017 at 02:59, Josh Elser <[email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> No, you are (likely) not running into ACCUMULO-4069. What you've >>>>>>>>>>>> described sounds like your client's ticket expired. Accumulo does >>>>>>>>>>>> not >>>>>>>>>>>> spawn any ticket renewal on the behalf of clients. >>>>>>>>>>>> >>>>>>>>>>>> Hadoop's UGI code will automatically spawn a renewal thread when >>>>>>>>>>>> you >>>>>>>>>>>> log in using a ticket cache. This does not happen automatically >>>>>>>>>>>> when >>>>>>>>>>>> you use a keytab (I have no explanation as to why this is). This >>>>>>>>>>>> is >>>>>>>>>>>> the most likely cause of your error and something you need to >>>>>>>>>>>> correct >>>>>>>>>>>> in your application (spawn a thread to renew your application's >>>>>>>>>>>> ticket). >>>>>>>>>>>> >>>>>>>>>>>> If you are using MapReduce, you have yet another layer of >>>>>>>>>>>> indirection >>>>>>>>>>>> with DelegationTokens, but that's probably not what you're seeing >>>>>>>>>>>> (as >>>>>>>>>>>> DelegationTokens don't actually have a Kerberos TGT). >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jul 10, 2017 at 5:42 PM, Christopher >>>>>>>>>>>> <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> It certainly sounds like the same issue. I'd recommend upgrading >>>>>>>>>>>>> to >>>>>>>>>>>>> the >>>>>>>>>>>>> latest 1.7.3 (currently the latest 1.7 version) to include all >>>>>>>>>>>>> the >>>>>>>>>>>>> bugs >>>>>>>>>>>>> we've found and fixed in that release line. >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jul 10, 2017 at 5:50 AM James Srinivasan >>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm using Accumulo 1.7.0 and finding that after some period of >>>>>>>>>>>>>> time >>>>>>>>>>>>>> (>8 hours, <3 days - happened over the weekend) my ingest fails >>>>>>>>>>>>>> with >>>>>>>>>>>>>> errors regarding "Failed to find any Kerberos tgt". My guess is >>>>>>>>>>>>>> that >>>>>>>>>>>>>> the ticket from the keytab has expired, and needs to be renewed >>>>>>>>>>>>>> - >>>>>>>>>>>>>> from >>>>>>>>>>>>>> memory, I had seen a Kerberos tgt renewer thread running in my >>>>>>>>>>>>>> client, >>>>>>>>>>>>>> so assumed it happened automagically. Is that the case? Perhaps >>>>>>>>>>>>>> I >>>>>>>>>>>>>> am >>>>>>>>>>>>>> hitting this bug? >>>>>>>>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-4069 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> >>>>>>>>>>>>>> James >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> busbey
