Stephan Erb created AURORA-1907:
-----------------------------------

             Summary: Thermos unresponsive on hosts with many active task
                 Key: AURORA-1907
                 URL: https://issues.apache.org/jira/browse/AURORA-1907
             Project: Aurora
          Issue Type: Story
          Components: Observer
            Reporter: Stephan Erb


We have noticed that on hosts with lots of active tasks (~100) and many 
terminated tasks (~1500) the Thermos UI is not usable. Thermos spins at 300% 
CPU but does not render any HTTP requests.

Dumping {{/threads}} indicates we might be blocked by the hundret 
{{TaskResourceMonitor}} threads trying to read values from {{/proc}}:
{code}
# Thread (daemon): TaskResourceMonitor (TaskResourceMonitor[mytask-id] 
[TID=45241], 140682825963264)
  File: "/usr/lib/python2.7/threading.py", line 525, in __bootstrap
    self.__bootstrap_inner()
  File: "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File: 
"/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py",
 line 115, in identified
    return instancemethod(self, *args, **kwargs)
  File: 
"/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 126, in _excepting_run
    self.__real_run(*args, **kw)
  File: "apache/thermos/monitoring/resource.py", line 204, in run
    collector.sample()
  File: "apache/thermos/monitoring/process_collector_psutil.py", line 70, in 
sample
    for child in parent.children(recursive=True)
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
 line 326, in wrapper
    return fun(self, *args, **kwargs)
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
 line 861, in children
    table[p.ppid()].append(p)
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/__init__.py",
 line 545, in ppid
    return self._proc.ppid()
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
 line 962, in wrapper
    return fun(self, *args, **kwargs)
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
 line 1459, in ppid
    return int(self._parse_stat_file()[2])
  File: 
"/.pex/install/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl.f4f23a781c020a8b8cb5cba2da0161d0db6452b1/psutil-4.3.0-cp27-cp27mu-linux_x86_64.whl/psutil/_pslinux.py",
 line 1001, in _parse_stat_file
    return [name] + fields_after_name
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to