Looks like I'm hitting this issue: https://issues.apache.org/jira/browse/HDFS-7005
It seems we had a networking hiccup and Hadoop didn't handle it well, with thread getting stuck on a socket. On Tue, Jun 16, 2015 at 12:31 PM, Matt K <[email protected]> wrote: > It's getting stuck while initializing distributed cache. Why I'm not sure. > > On Tue, Jun 16, 2015 at 12:18 PM, Matt K <[email protected]> wrote: > >> I made a little bit of progress with investigation. >> >> TaskTracker has 2 consumer threads to kick off tasks - one for mappers >> and another for reducers. The mapper thread got stuck, so the consumer >> isn't picking up any more tasks, and they are timing out after 10 minutes. >> >> Looks like it got stuck here: >> >> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754 >> >> And here's the stack-trace. From that it looks like it got stuck >> listening on a socket. >> >> Thread 9335: (state = IN_NATIVE) >> >> - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0 >> (Compiled frame; information may be imprecise) >> >> - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled >> frame) >> >> - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79 >> (Interpreted frame) >> >> - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87 >> (Interpreted frame) >> >> - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame) >> >> - >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel, >> int, long) @bci=46, line=335 (Interpreted frame) >> >> - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer, >> int) @bci=80, line=157 (Interpreted frame) >> >> - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer) >> @bci=6, line=161 (Interpreted frame) >> >> - >> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel, >> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame) >> >> - >> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel, >> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted >> frame) >> >> - >> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel, >> java.io.InputStream) @bci=293, line=171 (Interpreted frame) >> >> - >> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel) >> @bci=3, line=102 (Interpreted frame) >> >> - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int) >> @bci=27, line=138 (Interpreted frame) >> >> - >> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader, >> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7, >> line=686 (Interpreted frame) >> >> - >> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy, >> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame) >> >> - >> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy, >> int, int) @bci=141, line=799 (Interpreted frame) >> >> - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15, >> line=840 (Interpreted frame) >> >> - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted >> frame) >> >> - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream, >> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame) >> >> - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream, >> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame) >> >> - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream, >> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean) >> @bci=12, line=112 (Interpreted frame) >> >> - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.Path, boolean, boolean, >> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame) >> >> - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.Path, boolean, boolean, >> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame) >> >> - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.Path, boolean, boolean, >> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame) >> >> - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.Path, boolean, boolean, >> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame) >> >> - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem, >> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration) >> @bci=9, line=289 (Interpreted frame) >> >> - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean, >> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42, >> line=1968 (Interpreted frame) >> >> - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean, >> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937 >> (Interpreted frame) >> >> - >> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path, >> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame) >> >> - >> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration, >> java.net.URI, org.apache.hadoop.fs.Path, long, boolean, >> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445 >> (Interpreted frame) >> >> - >> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration, >> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean) >> @bci=150, line=328 (Interpreted frame) >> >> - >> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration) >> @bci=18, line=346 (Interpreted frame) >> >> - >> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID, >> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path, >> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol) >> @bci=51, line=391 (Interpreted frame) >> >> - >> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID, >> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path, >> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370 >> (Interpreted frame) >> >> - >> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String, >> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, >> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress) >> @bci=115, line=231 (Interpreted frame) >> >> - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472 >> (Interpreted frame) >> >> - >> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, >> java.security.AccessControlContext) @bci=0 (Interpreted frame) >> >> - javax.security.auth.Subject.doAs(javax.security.auth.Subject, >> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted >> frame) >> >> - >> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) >> @bci=14, line=1548 (Interpreted frame) >> >> - >> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task, >> org.apache.hadoop.mapred.TaskTracker$RunningJob, >> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame) >> >> - >> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress) >> @bci=96, line=1362 (Interpreted frame) >> >> - >> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress) >> @bci=2, line=2790 (Interpreted frame) >> >> - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491, >> line=2754 (Interpreted frame) >> There aren't any tasks running on that node when the thread-dump was >> taken, so I'm pretty sure it's stuck here. However, there's a socket >> timeout of 10 seconds, so I don't understand why it didn't time out. >> >> >> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <[email protected]> wrote: >> >>> Hi Nat, I should've mentioned this before. We're running MRv1. >>> >>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda < >>> [email protected]> wrote: >>> >>>> Hey Matt, >>>> >>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs / >>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly. >>>> We could dig in more if we have the yarn-site.xml and scheduler conf files. >>>> >>>> >>>> Thanks, >>>> Nat. >>>> >>>> >>>> >>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <[email protected]> wrote: >>>> >>>>> I see there's 2 threads - one that kicks off the mappers, and another >>>>> that kicks off reducers. The one that kicks off the mappers got stuck. >>>>> It's >>>>> not yet clear to me where it got stuck exactly. >>>>> >>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <[email protected]> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I'm dealing with a production issue, any help would be appreciated. I >>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up >>>>>> the >>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets >>>>>> killed 10 minutes later. >>>>>> >>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0 >>>>>> task's state:UNASSIGNED >>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>> attempt_201506152116_0046_m_000286_0: Task >>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600 >>>>>> seconds. Killing! >>>>>> >>>>>> Normally, I would see the following in the logs: >>>>>> >>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker: >>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 >>>>>> slots >>>>>> >>>>>> However, it doesn't get this far for these particular tasks. I am >>>>>> perusing the source code here, and this doesn't seem to be possible: >>>>>> >>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch >>>>>> >>>>>> The code does something like this: >>>>>> >>>>>> public void addToTaskQueue(LaunchTaskAction action) { >>>>>> synchronized (tasksToLaunch) { >>>>>> TaskInProgress tip = registerTask(action, this); >>>>>> tasksToLaunch.add(tip); >>>>>> tasksToLaunch.notifyAll(); >>>>>> } >>>>>> } >>>>>> >>>>>> The following should pick it up: >>>>>> >>>>>> public void run() { >>>>>> while (!Thread.interrupted()) { >>>>>> try { >>>>>> TaskInProgress tip; >>>>>> Task task; >>>>>> synchronized (tasksToLaunch) { >>>>>> while (tasksToLaunch.isEmpty()) { >>>>>> tasksToLaunch.wait(); >>>>>> } >>>>>> //get the TIP >>>>>> tip = tasksToLaunch.remove(0); >>>>>> task = tip.getTask(); >>>>>> LOG.info("Trying to launch : " + tip.getTask().getTaskID() + >>>>>> " which needs " + task.getNumSlotsRequired() + " >>>>>> slots"); >>>>>> } >>>>>> >>>>>> What's even stranger is that this is happening for Map tasks only. >>>>>> Reduce tasks are fine. >>>>>> >>>>>> This is only happening on a handful of the nodes, but enough to either >>>>>> slow down jobs or cause them to fail. >>>>>> >>>>>> We're running Hadoop 2.3.0-cdh5.0.2 >>>>>> >>>>>> Thanks, >>>>>> >>>>>> -Matt >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> www.calcmachine.com - easy online calculator. >>>>> >>>> >>>> >>> >>> >>> -- >>> www.calcmachine.com - easy online calculator. >>> >> >> >> >> -- >> www.calcmachine.com - easy online calculator. >> > > > > -- > www.calcmachine.com - easy online calculator. > -- www.calcmachine.com - easy online calculator.
