Rohith Sharma K S created YARN-7147:
---------------------------------------
Summary: ATS1.5 crash due to OOM
Key: YARN-7147
URL: https://issues.apache.org/jira/browse/YARN-7147
Project: Hadoop YARN
Issue Type: Bug
Components: timelineserver
Reporter: Rohith Sharma K S
Assignee: Rohith Sharma K S
It is observed that in production cluster, though _app-cache-size_ is set to
minimal i.e less than 5, ATS server is going down with OOM. The
_entity-group-fs-store.cache-store-class_ is configured with
MemoryTimelineStore which is by default. The heap size configured for ATS
daemon is 8GB.
This is because ATS parse the entity log file per domain and caches it. If the
domain has lot of entity information, then in memory cache store loads all the
entity information which is causing OOM. After restart, again it caches same
domain and goes OOM.
There are possible way handle it are
# threshold the number of entities loaded into in memory cache. This still can
lead to OOM if data size is huge.
# Based on the data size in the store.
We faced 1st issue where number of entities are very huge.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]