It sometimes works and sometimes fails. Usually, if my tasks take more than
an hour, at least one node fails and gets blacklisted. I have seen at most
five nodes do this.

I cleared my tmp directory, tried using /var/tmp for dfs and mapred storage
and I have over 300 GB free space in all nodes. Is there another log or
maybe a system log file that I can look at to see the root cause of this
issue?


On Tue, May 28, 2013 at 9:24 AM, Jean-Marc Spaggiari <
[email protected]> wrote:

> That's strange.
>
> So now each time you are running it it's railing with the exception below?
> Or it's sometime working, sometime failing?
>
> also, can you clear you tmp directory and make sure you have enough space
> it it before you retry?
>
> JM
>
>
> 2013/5/27 Jim Twensky <[email protected]>
>
>> Hi Jean,
>>
>> I switched to Oracle JDK 1.6 as you suggested and ran a job successfully
>> this afternoon which lasted for about 3 hours - this job was producing
>> errors with open JDK normally. I then stopped the cluster, re-started it
>> again and tried running the same job but I got the same "failure to log'in"
>> error using the Oracle JDK. This is really weird and unusual. I am pasting
>> the stack trace below. It occurred in 3 different nodes out of 20. Any
>> suggestions?
>>
>>
>>
>>
>>
>> Exception in thread "main" java.io.IOException: Exception reading
>> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305262239_0002/jobToken
>>
>>     at
>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>     at
>> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>     at org.apache.hadoop.mapred.Child.main(Child.java:92)
>> Caused by: java.io.IOException: failure to login
>>     at
>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>>     at
>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>     at
>> org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
>>     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>     at
>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>     ... 2 more
>> Caused by: javax.security.auth.login.LoginException:
>> java.lang.NullPointerException: invalid null input: name
>>     at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:53)
>>     at
>> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:114)
>>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>     at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>     at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>     at java.lang.reflect.Method.invoke(Method.java:597)
>>     at
>> javax.security.auth.login.LoginContext.invoke(LoginContext.java:769)
>>     at
>> javax.security.auth.login.LoginContext.access$000(LoginContext.java:186)
>>     at javax.security.auth.login.LoginContext$5.run(LoginContext.java:706)
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at
>> javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:703)
>>     at javax.security.auth.login.LoginContext.login(LoginContext.java:575)
>>     at
>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:482)
>>
>>     at
>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>     at
>> org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
>>     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>     at
>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>     at
>> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>     at org.apache.hadoop.mapred.Child.main(Child.java:92)
>>
>>     at
>> javax.security.auth.login.LoginContext.invoke(LoginContext.java:872)
>>     at
>> javax.security.auth.login.LoginContext.access$000(LoginContext.java:186)
>>     at javax.security.auth.login.LoginContext$5.run(LoginContext.java:706)
>>     at java.security.AccessController.doPrivileged(Native Method)
>>     at
>> javax.security.auth.login.LoginContext.invokeCreatorPriv(LoginContext.java:703)
>>     at javax.security.auth.login.LoginContext.login(LoginContext.java:575)
>>     at
>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:482)
>>
>>
>> On Sat, May 25, 2013 at 12:14 PM, Jean-Marc Spaggiari <
>> [email protected]> wrote:
>>
>>> Hi Jim,
>>>
>>> Will you be able to do the same test with Oracle JDK 1.6 instead of
>>> OpenJDK 1.7 to see if it maked a difference?
>>>
>>> JM
>>>
>>>
>>> 2013/5/25 Jim Twensky <[email protected]>
>>>
>>>> Hi Jean, thanks for replying. I'm using java 1.7.0_21 on ubuntu. Here
>>>> is the output:
>>>>
>>>> $ java -version
>>>> java version "1.7.0_21"
>>>> OpenJDK Runtime Environment (IcedTea 2.3.9)
>>>> (7u21-2.3.9-0ubuntu0.12.10.1)
>>>> OpenJDK 64-Bit Server VM (build 23.7-b01, mixed mode)
>>>>
>>>> I don't get any OOME errors and this error happens on random nodes, not
>>>> a particular one. Usually all tasks running on a particular node fail and
>>>> that node gets blacklisted. However, the same node works just fine during
>>>> the next or previous jobs. Can it be a problem with the ssh keys? What else
>>>> can cause the IOException with "failure to login" message? I've been
>>>> digging into this for two days but I'm almost clueless.
>>>>
>>>> Thanks,
>>>> Jim
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, May 24, 2013 at 10:32 PM, Jean-Marc Spaggiari <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Jim,
>>>>>
>>>>> Which JVM are you using?
>>>>>
>>>>> I don't think you have any memory issue. Else you will have got some
>>>>> OOME...
>>>>>
>>>>> JM
>>>>>
>>>>>
>>>>> 2013/5/24 Jim Twensky <[email protected]>
>>>>>
>>>>>> Hi again, in addition to my previous post, I was able to get some
>>>>>> error logs from the task tracker/data node this morning and looks like it
>>>>>> might be a jetty issue:
>>>>>>
>>>>>> 2013-05-23 19:59:20,595 WARN org.apache.hadoop.mapred.TaskLog: Failed
>>>>>> to retrieve stdout log for task: attempt_201305231647_0007_m_001096_0
>>>>>> java.io.IOException: Owner 'jim' for path
>>>>>> /var/tmp/jim/hadoop-logs/userlogs/job_201305231647_0007/attempt_201305231647_0007_m_001096_0/stdout
>>>>>> did not match expected owner '10929'
>>>>>>   at
>>>>>> org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:177)
>>>>>>   at
>>>>>> org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:117)
>>>>>>   at org.apache.hadoop.mapred.TaskLog$Reader.<init>(TaskLog.java:455)
>>>>>>   at
>>>>>> org.apache.hadoop.mapred.TaskLogServlet.printTaskLog(TaskLogServlet.java:81)
>>>>>>   at
>>>>>> org.apache.hadoop.mapred.TaskLogServlet.doGet(TaskLogServlet.java:296)
>>>>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
>>>>>>   at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
>>>>>>   at
>>>>>> org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
>>>>>>   at
>>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221)
>>>>>>   at
>>>>>> org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:848)
>>>>>>   at
>>>>>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>>>>>>   at
>>>>>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>>>>>>
>>>>>>
>>>>>> I am wondering if I am hitting 
>>>>>> MAPREDUCE-2389<https://issues.apache.org/jira/browse/MAPREDUCE-2389>If 
>>>>>> so, how do I downgrade my jetty version? Should I just replace the jetty
>>>>>> jar file in the lib directory with an earlier version and restart my
>>>>>> cluster?
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, May 23, 2013 at 7:14 PM, Jim Twensky 
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> Hello, I have a 20 node Hadoop cluster where each node has 8GB
>>>>>>> memory and an 8-core processor. I sometimes get the following error on a
>>>>>>> random basis:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> Exception in thread "main" java.io.IOException: Exception reading 
>>>>>>> file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
>>>>>>>         at 
>>>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
>>>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:92)
>>>>>>> Caused by: java.io.IOException: failure to login
>>>>>>>         at 
>>>>>>> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
>>>>>>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
>>>>>>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
>>>>>>>         at 
>>>>>>> org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
>>>>>>>         ... 2 more
>>>>>>> Caused by: javax.security.auth.login.LoginException: 
>>>>>>> java.lang.NullPointerException: invalid null input: name
>>>>>>>         at 
>>>>>>> com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:70)
>>>>>>>         at 
>>>>>>> com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
>>>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>         at 
>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>>>         at 
>>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>>>
>>>>>>> ......
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> This does not always happen but I see a pattern when the
>>>>>>> intermediate data is larger, it tends to occur more frequently. In the 
>>>>>>> web
>>>>>>> log, I can see the following:
>>>>>>>
>>>>>>> java.lang.Throwable: Child Error
>>>>>>>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
>>>>>>> Caused by: java.io.IOException: Task process exit with nonzero status 
>>>>>>> of 1.
>>>>>>>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
>>>>>>>
>>>>>>> From what I read online, a possible cause is when there is not
>>>>>>> enough memory for all JVM's. My mapred site.xml is set up to allocate
>>>>>>> 1100MB for each child and the maximum number of map and reduce tasks are
>>>>>>> set to 3 - So 6600MB of the child JVMs + (500MB * 2) for the data node 
>>>>>>> and
>>>>>>> task tracker (as I set HADOOP_HEAP to 500 MB). I feel like memory is not
>>>>>>> the cause but I couldn't avoid it so far.
>>>>>>> In case it helps, here are the relevant sections of my
>>>>>>> mapred-site.xml
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>>     <name>mapred.tasktracker.map.tasks.maximum</name>
>>>>>>>     <value>3</value>
>>>>>>>
>>>>>>>     <name>mapred.tasktracker.reduce.tasks.maximum</name>
>>>>>>>     <value>3</value>
>>>>>>>
>>>>>>>     <name>mapred.child.java.opts</name>
>>>>>>>     <value>-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
>>>>>>> -XX:HeapDumpPath=/var/tmp/soner</value>
>>>>>>>
>>>>>>>     <name>mapred.reduce.parallel.copies</name>
>>>>>>>     <value>5</value>
>>>>>>>
>>>>>>>     <name>tasktracker.http.threads</name>
>>>>>>>     <value>80</value>
>>>>>>>
>>>>>>> -----------------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> My jobs still complete most of the time though they occasionally
>>>>>>> fail and I'm really puzzled at this point. I'd appreciate any help or 
>>>>>>> ideas.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to