[ 
https://issues.apache.org/jira/browse/YARN-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15398141#comment-15398141
 ] 

Chackaravarthy commented on YARN-5445:
--------------------------------------

Environment : HDP-2.4 (hadoop-2.7.1)

The usecase is as follows :- 

Cluster is of size 1200 nodes and on average around 15k jobs running per day. 
Hence keeping applogs in the same cluster adds too much pressure on NN because 
of small files problem. Around 5Million files created per day (normal load) 
leading to 10Million FS objects for keeping one day logs itself. The 
requirement is to maintain atleast 1 week of log and hence decided to move it 
to different cluster or different namespace (NN federation).

In these cases, expecting minimal latency on jobs if the other cluster is 
completely down (though configured with HA). In such situation, would want to 
have minimal impact on applications running in cluster. But currently it does 
15 attempts {{dfs.client.failover.max.attempts}} to connect to NN before giving 
it up. Hence adding a latency of 2 to 2.5 mins on each container launch (per 
node manager) and hence affecting over all job completion time.

(Aware of YARN-2942 which is still in progress and MAPREDUCE-6415 is in 2.8.0)

Can we have a new config to pass it as {{dfs.client.failover.max.attempts}} 
while creating FileSystem instance in LogAggregationService so that we can 
configure it to fail fast? Or any configs already available to handle this case?

> Log aggregation configured to different namenode can fail fast
> --------------------------------------------------------------
>
>                 Key: YARN-5445
>                 URL: https://issues.apache.org/jira/browse/YARN-5445
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Chackaravarthy
>
> Log aggregation is enabled and configured to write applogs to different 
> cluster or different namespace (NN federation). In these cases, would like to 
> have some configs on attempts or retries to fail fast in case the other 
> cluster is completely down.
> Currently it takes default {{dfs.client.failover.max.attempts}} as 15 and 
> hence adding a latency of 2 to 2.5 mins in each container launch (per node 
> manager).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to