[ 
https://issues.apache.org/jira/browse/YARN-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14641768#comment-14641768
 ] 

Eric Payne commented on YARN-3978:
----------------------------------

Use Case: A user launches an application on a secured cluster that runs for 
some time and then fails within the AM (perhaps due to OOM in the AM), leaving 
no history in the job history server. The user doesn't notice that the job has 
failed until after the application has dropped off of the RM's application 
store. At this point, if no information was stored in the Generic Application 
History Service, a user must rely on a priviledged system administrator to 
access the AM logs for them.

It is desirable to activate the Generic Application History service within the 
timeline server so that users can access their application's information even 
after the RM has forgotten about their application. This app information should 
be kept in the GAHS for 1 week, as is done, for example, for logs in the job 
history server.

One way that the Generic AHS stores metadata about an application is in an 
Entity levelDB. This includes information about each container for each 
application. Based on my analysis, the levelDB size grows by at least 2500 
bytes per container (uncompressed). This is a conservative estimate as the size 
could be much bigger based on the amount of diagnostic information associated 
with failed containers.

On very large and busy clusters, the amount needed on the timeline server's 
local disk would be between 0.6 TB and 1.0 TB (uncompressed). Even if we assume 
90% compression, that's still between 60 GB and 100 GB that will be needed on 
the local disk. In addition to this, between 80 GB and 143 GB of metadata 
(uncopressed) will need to be cleaned up every day from the levelDB, which will 
delay other processing in the timeline server.

The proposal of this JIRA is to add a configuration property that 
enables/disables whether or not the GAHS stores container information in the 
levelDB. Whith this change, I estimate that the local disk usage would be about 
5700 bytes per job, or about 10 GB (uncompressed) per week. Additionally, the 
daily cleanup load would only be about 1.5 GB per day.


> Configurably turn off the saving of container info in Generic AHS
> -----------------------------------------------------------------
>
>                 Key: YARN-3978
>                 URL: https://issues.apache.org/jira/browse/YARN-3978
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: timelineserver, yarn
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>
> Depending on how each application's metadata is stored, one week's worth of 
> data stored in the Generic Application History Server's database can grow to 
> be almost a terabyte of local disk space. In order to alleviate this, I 
> suggest that there is a need for a configuration option to turn off saving of 
> non-AM container metadata in the GAHS data store.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to