It sometimes works and sometimes fails. Usually, if my tasks take more than an hour, at least one node fails and gets blacklisted. I have seen at most five nodes do this.
I cleared my tmp directory, tried using /var/tmp for dfs and mapred storage and I have over 300 GB free space in all nodes. Is there another log or maybe a system log file that I can look at to see the root cause of this issue? On Tue, May 28, 2013 at 9:24 AM, Jean-Marc Spaggiari < [email protected]> wrote: > That's strange. > > So now each time you are running it it's railing with the exception below? > Or it's sometime working, sometime failing? > > also, can you clear you tmp directory and make sure you have enough space > it it before you retry? > > JM > > > 2013/5/27 Jim Twensky <[email protected]> > >> Hi Jean, >> >> I switched to Oracle JDK 1.6 as you suggested and ran a job successfully >> this afternoon which lasted for about 3 hours - this job was producing >> errors with open JDK normally. I then stopped the cluster, re-started it >> again and tried running the same job but I got the same "failure to log'in" >> error using the Oracle JDK. This is really weird and unusual. I am pasting >> the stack trace below. It occurred in 3 different nodes out of 20. Any >> suggestions? >> >> >> >> >> >> Exception in thread "main" java.io.IOException: Exception reading >> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305262239_0002/jobToken >> >> at >> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135) >> at >> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165) >> at org.apache.hadoop.mapred.Child.main(Child.java:92) >> Caused by: java.io.IOException: failure to login >> at >> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501) >> at >> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463) >> at >> org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519) >> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420) >> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) >> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) >> at >> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129) >> ... 2 more >> Caused by: javax.security.auth.login.LoginException: >> java.lang.NullPointerException: invalid null input: name >> at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:53) >> at >> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:114) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> javax.security.auth.login.LoginContext.invoke(LoginContext.java:769) >> at >> javax.security.auth.login.LoginContext.access$000(LoginContext.java:186) >> at javax.security.auth.login.LoginContext$5.run(LoginContext.java:706) >> at java.security.AccessController.doPrivileged(Native Method) >> at >> javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:703) >> at javax.security.auth.login.LoginContext.login(LoginContext.java:575) >> at >> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:482) >> >> at >> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463) >> at >> org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519) >> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420) >> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) >> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) >> at >> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129) >> at >> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165) >> at org.apache.hadoop.mapred.Child.main(Child.java:92) >> >> at >> javax.security.auth.login.LoginContext.invoke(LoginContext.java:872) >> at >> javax.security.auth.login.LoginContext.access$000(LoginContext.java:186) >> at javax.security.auth.login.LoginContext$5.run(LoginContext.java:706) >> at java.security.AccessController.doPrivileged(Native Method) >> at >> javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:703) >> at javax.security.auth.login.LoginContext.login(LoginContext.java:575) >> at >> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:482) >> >> >> On Sat, May 25, 2013 at 12:14 PM, Jean-Marc Spaggiari < >> [email protected]> wrote: >> >>> Hi Jim, >>> >>> Will you be able to do the same test with Oracle JDK 1.6 instead of >>> OpenJDK 1.7 to see if it maked a difference? >>> >>> JM >>> >>> >>> 2013/5/25 Jim Twensky <[email protected]> >>> >>>> Hi Jean, thanks for replying. I'm using java 1.7.0_21 on ubuntu. Here >>>> is the output: >>>> >>>> $ java -version >>>> java version "1.7.0_21" >>>> OpenJDK Runtime Environment (IcedTea 2.3.9) >>>> (7u21-2.3.9-0ubuntu0.12.10.1) >>>> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode) >>>> >>>> I don't get any OOME errors and this error happens on random nodes, not >>>> a particular one. Usually all tasks running on a particular node fail and >>>> that node gets blacklisted. However, the same node works just fine during >>>> the next or previous jobs. Can it be a problem with the ssh keys? What else >>>> can cause the IOException with "failure to login" message? I've been >>>> digging into this for two days but I'm almost clueless. >>>> >>>> Thanks, >>>> Jim >>>> >>>> >>>> >>>> >>>> On Fri, May 24, 2013 at 10:32 PM, Jean-Marc Spaggiari < >>>> [email protected]> wrote: >>>> >>>>> Hi Jim, >>>>> >>>>> Which JVM are you using? >>>>> >>>>> I don't think you have any memory issue. Else you will have got some >>>>> OOME... >>>>> >>>>> JM >>>>> >>>>> >>>>> 2013/5/24 Jim Twensky <[email protected]> >>>>> >>>>>> Hi again, in addition to my previous post, I was able to get some >>>>>> error logs from the task tracker/data node this morning and looks like it >>>>>> might be a jetty issue: >>>>>> >>>>>> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed >>>>>> to retrieve stdout log for task: attempt_201305231647_0007_m_001096_0 >>>>>> java.io.IOException: Owner 'jim' for path >>>>>> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout >>>>>> did not match expected owner '10929' >>>>>> at >>>>>> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177) >>>>>> at >>>>>> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117) >>>>>> at org.apache.hadoop.mapred.TaskLog$Reader.<init>(TaskLog.java:455) >>>>>> at >>>>>> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81) >>>>>> at >>>>>> org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296) >>>>>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) >>>>>> at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) >>>>>> at >>>>>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) >>>>>> at >>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) >>>>>> at >>>>>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848) >>>>>> at >>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) >>>>>> at >>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) >>>>>> >>>>>> >>>>>> I am wondering if I am hitting >>>>>> MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If >>>>>> so, how do I downgrade my jetty version? Should I just replace the jetty >>>>>> jar file in the lib directory with an earlier version and restart my >>>>>> cluster? >>>>>> >>>>>> Thank you. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB >>>>>>> memory and an 8-core processor. I sometimes get the following error on a >>>>>>> random basis: >>>>>>> >>>>>>> >>>>>>> >>>>>>> ----------------------------------------------------------------------------------------------------------- >>>>>>> >>>>>>> Exception in thread "main" java.io.IOException: Exception reading >>>>>>> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken >>>>>>> at >>>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135) >>>>>>> at >>>>>>> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165) >>>>>>> at org.apache.hadoop.mapred.Child.main(Child.java:92) >>>>>>> Caused by: java.io.IOException: failure to login >>>>>>> at >>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501) >>>>>>> at >>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463) >>>>>>> at >>>>>>> org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519) >>>>>>> at >>>>>>> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420) >>>>>>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254) >>>>>>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187) >>>>>>> at >>>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129) >>>>>>> ... 2 more >>>>>>> Caused by: javax.security.auth.login.LoginException: >>>>>>> java.lang.NullPointerException: invalid null input: name >>>>>>> at >>>>>>> com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:70) >>>>>>> at >>>>>>> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132) >>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>> at >>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>>>> at >>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>> >>>>>>> ...... >>>>>>> >>>>>>> >>>>>>> ----------------------------------------------------------------------------------------------------------- >>>>>>> >>>>>>> This does not always happen but I see a pattern when the >>>>>>> intermediate data is larger, it tends to occur more frequently. In the >>>>>>> web >>>>>>> log, I can see the following: >>>>>>> >>>>>>> java.lang.Throwable: Child Error >>>>>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) >>>>>>> Caused by: java.io.IOException: Task process exit with nonzero status >>>>>>> of 1. >>>>>>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) >>>>>>> >>>>>>> From what I read online, a possible cause is when there is not >>>>>>> enough memory for all JVM's. My mapred site.xml is set up to allocate >>>>>>> 1100MB for each child and the maximum number of map and reduce tasks are >>>>>>> set to 3 - So 6600MB of the child JVMs + (500MB * 2) for the data node >>>>>>> and >>>>>>> task tracker (as I set HADOOP_HEAP to 500 MB). I feel like memory is not >>>>>>> the cause but I couldn't avoid it so far. >>>>>>> In case it helps, here are the relevant sections of my >>>>>>> mapred-site.xml >>>>>>> >>>>>>> >>>>>>> ----------------------------------------------------------------------------------------------------------- >>>>>>> >>>>>>> <name>mapred.tasktracker.map.tasks.maximum</name> >>>>>>> <value>3</value> >>>>>>> >>>>>>> <name>mapred.tasktracker.reduce.tasks.maximum</name> >>>>>>> <value>3</value> >>>>>>> >>>>>>> <name>mapred.child.java.opts</name> >>>>>>> <value>-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError >>>>>>> -XX:HeapDumpPath=/var/tmp/soner</value> >>>>>>> >>>>>>> <name>mapred.reduce.parallel.copies</name> >>>>>>> <value>5</value> >>>>>>> >>>>>>> <name>tasktracker.http.threads</name> >>>>>>> <value>80</value> >>>>>>> >>>>>>> ----------------------------------------------------------------------------------------------------------- >>>>>>> >>>>>>> My jobs still complete most of the time though they occasionally >>>>>>> fail and I'm really puzzled at this point. I'd appreciate any help or >>>>>>> ideas. >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >
