Yup, I am indeed on HDP - thanks for the link. The services do log GSS exceptions every ten hours, but seem to sufficiently recover themselves. Having turned up logging on my client:
1) On client start, I see hadoop login messages 2) After 8 hours (0.8*10 hours) when the renewal is expected to take place, I don't see any hadoop login messages 3) After 10 hours, I see GSS exceptions 4) After each GSS exception, I see an attempt to renew but using ticket cache, rather than keytab. Currently working on shortening the 10 hour expiry time so I can catch it in a debugger! Thanks, James On 13 July 2017 at 15:20, Josh Elser <els...@apache.org> wrote: > If you're using Hortonworks' HDP, you would probably benefit from > https://github.com/hortonworks/accumulo > > There is likely a git-tag for the exact version that you're running. The > line numbers would match there. > > To be clear, if your services (e.g. TabletServers) aren't failing after > 10hrs, you're not running into ACCUMULO-4069. Given my (limited) > understanding, your problem is purely client-side. It's possible that the > client-side RPC implementation isn't correctly handling the ticket re-login, > but I know there is specifically code in there to handle the re-login case. > > The next step would be getting some debug logging from your application > around UserGroupInformation or the JDK itself, or just spin up a trivial > example with a small relogin window to reproduce the problem. > > On 7/12/17 3:48 PM, James Srinivasan wrote: >> >> Yup, I'm going to spin up a vanilla 1.7.0 (maybe newer) install too to >> see if it behaves any differently. There is at least one patch >> included in their distro that isn't in the formal documentation, plus >> it makes matching line numbers in logs to src code rather difficult. >> >> Thanks, >> >> James >> >> On 12 July 2017 at 20:37, Sean Busbey <bus...@cloudera.com> wrote: >>> >>> Hi James! >>> >>> It sounds like you may need to chase things down with your vendor, >>> since the precise combination of patches included will make looking at >>> things hard for the community. >>> >>> On Wed, Jul 12, 2017 at 11:01 AM, James Srinivasan >>> <james.sriniva...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> So I've fired off a thread to perform the periodic >>>> checkTGTAndReloginFromKeytab call which seems to be running, but the >>>> connection still fails with GSS errors after precisely 10 hours. >>>> >>>> While I am running 1.7.0, it seems the vendor included the >>>> ACCUMULO-4069 patch, and immediately after the exception is thrown I >>>> see a log entry "Performing ticket-cache-based Kerberos re-login". >>>> However, it should be using a keytab - have turned up the logging to >>>> 11 and will leave running overnight... >>>> >>>> James >>>> >>>> On 11 July 2017 at 16:17, Josh Elser <josh.el...@gmail.com> wrote: >>>>> >>>>> Nope, you've got it exactly right! That's the code I would've pointed >>>>> you at >>>>> to copy :) >>>>> >>>>> If/when you do get to long-running MR jobs, see the >>>>> "general.delegation.token.*" configuration properties in this table[1]. >>>>> I >>>>> think the docs are citing that one delegation token is valid for 7 >>>>> days, but >>>>> it's been a long time since writing/testing that code. >>>>> >>>>> - Josh >>>>> >>>>> [1] >>>>> >>>>> https://accumulo.apache.org/1.8/accumulo_user_manual.html#_server_configuration_2 >>>>> >>>>> On 7/11/17 1:25 AM, James Srinivasan wrote: >>>>>> >>>>>> >>>>>> Thanks both. I can't (easily) upgrade beyond 1.7.0, but have raised a >>>>>> support case with our Hadoop distribution vendor. >>>>>> >>>>>> I'm not (yet) worried about expiration with MapReduce - for now I'll >>>>>> try to keep such jobs to under 24h! Outside MR, sounds like I just >>>>>> need to periodically call >>>>>> UserGroupInformation.checkTGTAndReloginFromKeytab like >>>>>> >>>>>> >>>>>> >>>>>> https://github.com/apache/accumulo/blob/master/server/base/src/main/java/org/apache/accumulo/server/security/SecurityUtil.java#L121 >>>>>> >>>>>> Or is the TGT associated with an Accumulo KerberosToken separate? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> James >>>>>> >>>>>> On 11 July 2017 at 02:59, Josh Elser <josh.el...@gmail.com> wrote: >>>>>>> >>>>>>> >>>>>>> No, you are (likely) not running into ACCUMULO-4069. What you've >>>>>>> described sounds like your client's ticket expired. Accumulo does not >>>>>>> spawn any ticket renewal on the behalf of clients. >>>>>>> >>>>>>> Hadoop's UGI code will automatically spawn a renewal thread when you >>>>>>> log in using a ticket cache. This does not happen automatically when >>>>>>> you use a keytab (I have no explanation as to why this is). This is >>>>>>> the most likely cause of your error and something you need to correct >>>>>>> in your application (spawn a thread to renew your application's >>>>>>> ticket). >>>>>>> >>>>>>> If you are using MapReduce, you have yet another layer of indirection >>>>>>> with DelegationTokens, but that's probably not what you're seeing (as >>>>>>> DelegationTokens don't actually have a Kerberos TGT). >>>>>>> >>>>>>> On Mon, Jul 10, 2017 at 5:42 PM, Christopher <ctubb...@apache.org> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> It certainly sounds like the same issue. I'd recommend upgrading to >>>>>>>> the >>>>>>>> latest 1.7.3 (currently the latest 1.7 version) to include all the >>>>>>>> bugs >>>>>>>> we've found and fixed in that release line. >>>>>>>> >>>>>>>> On Mon, Jul 10, 2017 at 5:50 AM James Srinivasan >>>>>>>> <james.sriniva...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm using Accumulo 1.7.0 and finding that after some period of time >>>>>>>>> (>8 hours, <3 days - happened over the weekend) my ingest fails >>>>>>>>> with >>>>>>>>> errors regarding "Failed to find any Kerberos tgt". My guess is >>>>>>>>> that >>>>>>>>> the ticket from the keytab has expired, and needs to be renewed - >>>>>>>>> from >>>>>>>>> memory, I had seen a Kerberos tgt renewer thread running in my >>>>>>>>> client, >>>>>>>>> so assumed it happened automagically. Is that the case? Perhaps I >>>>>>>>> am >>>>>>>>> hitting this bug? >>>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-4069 >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> James >>> >>> >>> >>> >>> -- >>> busbey