I'm still working on this issue with the expired token error killing a long running job. I noticed that the job failed soon after 24 hours and that there is a setting "yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs" which by default is set to 24 hours. I could not find more information on this other than the description which states that this is the interval at which the master key rollovers to generate container tokens. Is it possible that this master key rolled over at 24 hours and thus caused the expired token issue?
Unfortunately I could not find the the "yarn.resourcemanager.container-tokens.master-key-rolling-interval-secs" setting in Cloudera Manager (I know that is cloudera specific) but I think I can set it manually if anyone thinks that is worth trying. Best Regards, Ed Dorsey On Fri, Oct 16, 2015 at 3:41 PM, ed <[email protected]> wrote: > Hello, > > We just kicked off a large MR job that uses all the containers on our > cluster. The job ran for 24 hours and then failed with the following error > in the map phase (no reducers had started yet): > > 2015-10-16 12:38:17,781 ERROR [ContainerLauncher #2] > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: > Container launch failed for container_1444916180373_0003_01_089692 : > org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to > start container. > > This token is expired. current time is 1445013467749 found 1445013416633 > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > > at > org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152) > > at > org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106) > > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:155) > > at > org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > > The job has not had issues in the past although this time it was running > on a particularly large dataset. I checked all of our nodes and the times > on the nodes are all properly synced with NTP. I found the JIRA issue > YARN-1417 which seems to describe the problem we're having ( > https://issues.apache.org/jira/browse/YARN-1417) but this issue is mark > resolved and the patch was included in CDH5.0.0 (we are running 5.0.2) so > we should not be having that particular problem. > > > Could this be another bug in YARN related to expired tokens being > assigned? I searched through JIRA but did not see any open issues that > might relate to the error we're seeing. Are there any work around to this > or has anyone seen this happen before? Please let me know if there is any > other information I can provide. > > > Best Regards, > > > Ed Dorsey >
