[ 
https://issues.apache.org/jira/browse/YARN-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15240244#comment-15240244
 ] 

Daniel Templeton commented on YARN-4882:
----------------------------------------

I'm going to assume the silence means I captured it pretty well.

bq. We don't want to flood the logs with an intractable number of log messages 
during recovery

This one is clearly solved by both having an extra log file and just dialing 
down the log level.

bq. We need to be able to identify bad applications in the case that recovery 
fails

As long as we don't dial down the log level for recovery failures, both 
solutions seem to address this objective as well.  On the point that sometimes 
knowing what didn't fail is useful in a failed recovery, let me ask a question. 
 If the recovery fails, the RM fails to start, right?  If the RM fails to 
start, it's possible to change the log level before starting it again, if 
getting a list of the successful recoveries is helpful, right?  And since that 
recovery will also fail, it's possible to reset the log level before the final 
restart after resolving the issue.  Is there a scenario where the RM starts 
successfully but the list of recovered apps is still useful?

I dislike the idea of adding an extra log file and a property to enable it to 
the admin's plate for the sole purpose of logging successful recoveries, when 
that information is not commonly useful, and when the same information could be 
retrieved through a well known existing mechanism (changing the log level).

I propose that we streamline the log messages to be useful and succinct.  The 
RM should detail the recovery statistics at the info level, recovery failures 
at the warn or error level, and recovery successes at the debug level.  The log 
messages should also be reworked to include as information as possible to 
assist in debugging failures while being less chatty.

Any other thoughts?

> Change the log level to DEBUG for recovering completed applications
> -------------------------------------------------------------------
>
>                 Key: YARN-4882
>                 URL: https://issues.apache.org/jira/browse/YARN-4882
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith Sharma K S
>            Assignee: Daniel Templeton
>
> I think for recovering completed applications no need to log as INFO, rather 
> it can be made it as DEBUG.  The problem seen from large cluster is if any 
> issue happens during RM start up and continuously switching , then  RM logs 
> are filled with most with recovering applications only. 
> There are 6 lines are logged for 1 applications as I shown in below logs, 
> then consider RM default value for max-completed applications is 10K. So for 
> each switch 10K*6=60K lines will be added which is not useful I feel.
> {noformat}
> 2016-03-01 10:20:59,077 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Default priority 
> level is set to application:application_1456298208485_21507
> 2016-03-01 10:20:59,094 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Recovering 
> app: application_1456298208485_21507 with 1 attempts and final state = 
> FINISHED
> 2016-03-01 10:20:59,100 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> Recovering attempt: appattempt_1456298208485_21507_000001 with final state: 
> FINISHED
> 2016-03-01 10:20:59,107 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> appattempt_1456298208485_21507_000001 State change from NEW to FINISHED
> 2016-03-01 10:20:59,111 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: 
> application_1456298208485_21507 State change from NEW to FINISHED
> 2016-03-01 10:20:59,112 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=rohith   
> OPERATION=Application Finished - Succeeded      TARGET=RMAppManager     
> RESULT=SUCCESS  APPID=application_1456298208485_21507
> {noformat}
> The main problem is missing important information's from the logs before RM 
> unstable. Even though log roll back is 50 or 100, in a short period all these 
> logs will be rolled out and all the logs contains only RM switching 
> information that too recovering applications!!. 
> I suggest at least completed applications recovery should be logged as DEBUG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to