Re: tasks stuck in UNASSIGNED state

Matt K Tue, 16 Jun 2015 09:32:42 -0700

It's getting stuck while initializing distributed cache. Why I'm not sure.

On Tue, Jun 16, 2015 at 12:18 PM, Matt K <[email protected]> wrote:


> I made a little bit of progress with investigation.
>
> TaskTracker has 2 consumer threads to kick off tasks - one for mappers and
> another for reducers. The mapper thread got stuck, so the consumer isn't
> picking up any more tasks, and they are timing out after 10 minutes.
>
> Looks like it got stuck here:
>
> https://github.com/matvey14/cdh5.0.2/blob/master/src/hadoop-mapreduce1-project/src/mapred/org/apache/hadoop/mapred/TaskTracker.java#L2754
>
> And here's the stack-trace. From that it looks like it got stuck listening
> on a socket.
>
> Thread 9335: (state = IN_NATIVE)
>
>  - sun.nio.ch.EPollArrayWrapper.epollWait(long, int, long, int) @bci=0
> (Compiled frame; information may be imprecise)
>
>  - sun.nio.ch.EPollArrayWrapper.poll(long) @bci=18, line=269 (Compiled
> frame)
>
>  - sun.nio.ch.EPollSelectorImpl.doSelect(long) @bci=28, line=79
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.lockAndDoSelect(long) @bci=37, line=87
> (Interpreted frame)
>
>  - sun.nio.ch.SelectorImpl.select(long) @bci=30, line=98 (Compiled frame)
>
>  -
> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(java.nio.channels.SelectableChannel,
> int, long) @bci=46, line=335 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketIOWithTimeout.doIO(java.nio.ByteBuffer,
> int) @bci=80, line=157 (Interpreted frame)
>
>  - org.apache.hadoop.net.SocketInputStream.read(java.nio.ByteBuffer)
> @bci=6, line=161 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(java.nio.channels.ReadableByteChannel,
> java.nio.ByteBuffer) @bci=9, line=258 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(java.nio.channels.ReadableByteChannel,
> java.io.InputStream, java.nio.ByteBuffer) @bci=6, line=209 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(java.nio.channels.ReadableByteChannel,
> java.io.InputStream) @bci=293, line=171 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(java.nio.channels.ReadableByteChannel)
> @bci=3, line=102 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.RemoteBlockReader2.read(byte[], int, int)
> @bci=27, line=138 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(org.apache.hadoop.hdfs.BlockReader,
> int, int, org.apache.hadoop.hdfs.DFSInputStream$ReadStatistics) @bci=7,
> line=686 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readBuffer(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int, java.util.Map) @bci=14, line=742 (Interpreted frame)
>
>  -
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(org.apache.hadoop.hdfs.DFSInputStream$ReaderStrategy,
> int, int) @bci=141, line=799 (Interpreted frame)
>
>  - org.apache.hadoop.hdfs.DFSInputStream.read(byte[], int, int) @bci=15,
> line=840 (Interpreted frame)
>
>  - java.io.DataInputStream.read(byte[]) @bci=8, line=100 (Interpreted
> frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int) @bci=67, line=84 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, int, boolean) @bci=3, line=52 (Interpreted frame)
>
>  - org.apache.hadoop.io.IOUtils.copyBytes(java.io.InputStream,
> java.io.OutputStream, org.apache.hadoop.conf.Configuration, boolean)
> @bci=12, line=112 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=138, line=366 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=95, line=356 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, boolean,
> org.apache.hadoop.conf.Configuration) @bci=18, line=338 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileUtil.copy(org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
> org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration)
> @bci=9, line=289 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path, boolean) @bci=42,
> line=1968 (Interpreted frame)
>
>  - org.apache.hadoop.fs.FileSystem.copyToLocalFile(boolean,
> org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path) @bci=5, line=1937
> (Interpreted frame)
>
>  -
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path) @bci=4, line=1913 (Interpreted frame)
>
>  -
> org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(org.apache.hadoop.conf.Configuration,
> java.net.URI, org.apache.hadoop.fs.Path, long, boolean,
> org.apache.hadoop.fs.permission.FsPermission) @bci=288, line=445
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCacheObjects(org.apache.hadoop.conf.Configuration,
> java.net.URI[], org.apache.hadoop.fs.Path[], long[], boolean[], boolean)
> @bci=150, line=328 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.downloadPrivateCache(org.apache.hadoop.conf.Configuration)
> @bci=18, line=346 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.fs.Path, org.apache.hadoop.mapred.TaskUmbilicalProtocol)
> @bci=51, line=391 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.JobLocalizer.localizeJobFiles(org.apache.hadoop.mapreduce.JobID,
> org.apache.hadoop.mapred.JobConf, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=21, line=370
> (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(java.lang.String,
> java.lang.String, org.apache.hadoop.fs.Path, org.apache.hadoop.fs.Path,
> org.apache.hadoop.mapred.TaskUmbilicalProtocol, java.net.InetSocketAddress)
> @bci=115, line=231 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$4.run() @bci=163, line=1472
> (Interpreted frame)
>
>  -
> java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction,
> java.security.AccessControlContext) @bci=0 (Interpreted frame)
>
>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject,
> java.security.PrivilegedExceptionAction) @bci=42, line=415 (Interpreted
> frame)
>
>  -
> org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction)
> @bci=14, line=1548 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.initializeJob(org.apache.hadoop.mapred.Task,
> org.apache.hadoop.mapred.TaskTracker$RunningJob,
> java.net.InetSocketAddress) @bci=215, line=1447 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.localizeJob(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=96, line=1362 (Interpreted frame)
>
>  -
> org.apache.hadoop.mapred.TaskTracker.startNewTask(org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
> @bci=2, line=2790 (Interpreted frame)
>
>  - org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run() @bci=491,
> line=2754 (Interpreted frame)
> There aren't any tasks running on that node when the thread-dump was
> taken, so I'm pretty sure it's stuck here. However, there's a socket
> timeout of 10 seconds, so I don't understand why it didn't time out.
>
>
> On Tue, Jun 16, 2015 at 7:51 AM, Matt K <[email protected]> wrote:
>
>> Hi Nat, I should've mentioned this before. We're running MRv1.
>>
>> On Tue, Jun 16, 2015 at 2:24 AM, nataraj jonnalagadda <
>> [email protected]> wrote:
>>
>>> Hey Matt,
>>>
>>> Its possibly due to your YARN config... Possibly, YARN/Mapred ACLs /
>>> YARN scheduler config or Cgroups not (incase enabled) set up not correctly.
>>> We could dig in more if we have the yarn-site.xml and scheduler conf files.
>>>
>>>
>>> Thanks,
>>> Nat.
>>>
>>>
>>>
>>> On Mon, Jun 15, 2015 at 10:39 PM, Matt K <[email protected]> wrote:
>>>
>>>> I see there's 2 threads - one that kicks off the mappers, and another
>>>> that kicks off reducers. The one that kicks off the mappers got stuck. It's
>>>> not yet clear to me where it got stuck exactly.
>>>>
>>>> On Tue, Jun 16, 2015 at 1:11 AM, Matt K <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm dealing with a production issue, any help would be appreciated. I
>>>>> am seeing very strange behavior in the TaskTrackers. After they pick up 
>>>>> the
>>>>> task, it never comes out of the UNASSIGNED state, and the task just gets
>>>>> killed 10 minutes later.
>>>>>
>>>>> 2015-06-16 02:42:21,114 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> LaunchTaskAction (registerTask): attempt_201506152116_0046_m_000286_0
>>>>> task's state:UNASSIGNED
>>>>> 2015-06-16 02:52:21,805 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> attempt_201506152116_0046_m_000286_0: Task
>>>>> attempt_201506152116_0046_m_000286_0 failed to report status for 600
>>>>> seconds. Killing!
>>>>>
>>>>> Normally, I would see the following in the logs:
>>>>>
>>>>> 2015-06-16 04:30:32,328 INFO org.apache.hadoop.mapred.TaskTracker:
>>>>> Trying to launch : attempt_201506152116_0062_r_000004_0 which needs 1 
>>>>> slots
>>>>>
>>>>> However, it doesn't get this far for these particular tasks. I am
>>>>> perusing the source code here, and this doesn't seem to be possible:
>>>>>
>>>>> http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-core/0.20.2-737/org/apache/hadoop/mapred/TaskTracker.java#TaskTracker.TaskLauncher.0tasksToLaunch
>>>>>
>>>>> The code does something like this:
>>>>>
>>>>>     public void addToTaskQueue(LaunchTaskAction action) {
>>>>>       synchronized (tasksToLaunch) {
>>>>>         TaskInProgress tip = registerTask(action, this);
>>>>>         tasksToLaunch.add(tip);
>>>>>         tasksToLaunch.notifyAll();
>>>>>       }
>>>>>     }
>>>>>
>>>>> The following should pick it up:
>>>>>
>>>>>     public void run() {
>>>>>       while (!Thread.interrupted()) {
>>>>>         try {
>>>>>           TaskInProgress tip;
>>>>>           Task task;
>>>>>           synchronized (tasksToLaunch) {
>>>>>             while (tasksToLaunch.isEmpty()) {
>>>>>               tasksToLaunch.wait();
>>>>>             }
>>>>>             //get the TIP
>>>>>             tip = tasksToLaunch.remove(0);
>>>>>             task = tip.getTask();
>>>>>             LOG.info("Trying to launch : " + tip.getTask().getTaskID() +
>>>>>                      " which needs " + task.getNumSlotsRequired() + " 
>>>>> slots");
>>>>>           }
>>>>>
>>>>> What's even stranger is that this is happening for Map tasks only. Reduce 
>>>>> tasks are fine.
>>>>>
>>>>> This is only happening on a handful of the nodes, but enough to either 
>>>>> slow down jobs or cause them to fail.
>>>>>
>>>>> We're running Hadoop 2.3.0-cdh5.0.2
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -Matt
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> www.calcmachine.com - easy online calculator.
>>>>
>>>
>>>
>>
>>
>> --
>> www.calcmachine.com - easy online calculator.
>>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>



-- 
www.calcmachine.com - easy online calculator.

Re: tasks stuck in UNASSIGNED state

Reply via email to