[ 
https://issues.apache.org/jira/browse/YARN-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611092#comment-17611092
 ] 

fanshilun edited comment on YARN-11322 at 9/29/22 3:32 PM:
-----------------------------------------------------------

Hello, thank you very much for your feedback, but from my personal point of 
view, I think this is caused by the size of the cluster you configured is 
smaller than the number of retries, which will cause the sub-cluster to be 
blacklisted, and then no sub-cluster is available.

It is actually very difficult to report an exception to the client.

1.in the process of retrying, the first few retries may fail, but the last 
retry succeeds. In this case, there is no need to return an exception.

2.only all retry fails, an exception should be returned, but all fail, it does 
mean that the cluster is unavailable. So the error message collected by the 
client is reasonable.

For example, we have 4 sub-clusters and all submissions failed, so should we 
return all error messages? I don't think it's reasonable.

In the actual cluster operation and maintenance process, we hope that customers 
know less information. If there is a lot of returned information, they do not 
know how to deal with it. 

Thanks again for your feedback, I don't think this needs to be changed.

 


was (Author: slfan1989):
Hello, thank you very much for your feedback, but from my personal point of 
view, I think this is caused by the size of the cluster you configured is 
smaller than the number of retries, which will cause the sub-cluster to be 
blacklisted, and then no sub-cluster is available.

It is actually very difficult to report an exception to the client.

1.in the process of retrying, the first few retries may fail, but the last 
retry succeeds. In this case, there is no need to return an exception.

2.only all retry fails, an exception should be returned, but all fail, it does 
mean that the cluster is unavailable. So the error message collected by the 
client is reasonable.

For example, we have 4 sub-clusters and all submissions failed, so should we 
return all error messages? I don't think it's reasonable.

 

 

> Improve router FederationClientInterceptor#submitApplication exception
> ----------------------------------------------------------------------
>
>                 Key: YARN-11322
>                 URL: https://issues.apache.org/jira/browse/YARN-11322
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: router
>            Reporter: FanXiaoyu
>            Priority: Major
>         Attachments: exception in router log.png, exception received by 
> client.png, image-2022-09-29-22-38-48-655.png
>
>
> If an application submittion failed due to its user configuration (e.g 
> invalid resource request), client will try each SubCluster till an exception 
> is thrown out, which shows no active SubCluster is available. This message 
> will mislead user to believe that something is going wrong on subclusters.
> This issue is trying to make exception more understandable by passing RM 
> exception message to router.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to