[ 
https://issues.apache.org/jira/browse/YARN-11322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611092#comment-17611092
 ] 

fanshilun edited comment on YARN-11322 at 9/29/22 10:56 PM:
------------------------------------------------------------

Hello, thank you very much for your feedback, but from my personal point of 
view, I think this is caused by the size of the cluster you configured is 
smaller than the number of retries, which will cause the sub-cluster to be 
blacklisted, and then no sub-cluster is available. The error information 
returned by the Router to the client is accurate, because in the current 
situation, the cluster does not have an available subcluser.

It is actually very difficult to report an exception to the client.

1.in the process of retrying, the first few retries may fail, but the last 
retry succeeds. In this case, there is no need to return an exception.

2.only all retry fails, an exception should be returned, but all fail, it does 
mean that the cluster is unavailable. It is simple and accurate to return no 
subclusters available at this point.

For example, we have 4 sub-clusters and all submissions failed, so should we 
return all error messages? This returns too many error messages, but if we only 
return the error messages of a few subclusers, the client will also be misled. 
When the all subclusters are not available, all the apps will receive a lot of 
stacks, doesn't seem like a very good approach.

In the actual cluster operation and maintenance process, we hope that client 
know less information. 

Thanks again for your feedback, I don't think this needs to be changed. 

For some error messages, we have to look at the RM log to resolve, and pass a 
large error stack to the client, which is not good for performance.


was (Author: slfan1989):
Hello, thank you very much for your feedback, but from my personal point of 
view, I think this is caused by the size of the cluster you configured is 
smaller than the number of retries, which will cause the sub-cluster to be 
blacklisted, and then no sub-cluster is available. The error information 
returned by the Router to the client is accurate, because in the current 
situation, the cluster does not have an available subcluser.

It is actually very difficult to report an exception to the client.

1.in the process of retrying, the first few retries may fail, but the last 
retry succeeds. In this case, there is no need to return an exception.

2.only all retry fails, an exception should be returned, but all fail, it does 
mean that the cluster is unavailable. It is simple and accurate to return no 
subclusters available at this point.

For example, we have 4 sub-clusters and all submissions failed, so should we 
return all error messages? This returns too many error messages, but if we only 
return the error messages of a few subclusers, the client will also be misled.

In the actual cluster operation and maintenance process, we hope that customers 
know less information. 

Thanks again for your feedback, I don't think this needs to be changed. 

For some error messages, we have to look at the RM log to resolve, and pass a 
large error stack to the client, which is not good for performance.

> Improve router FederationClientInterceptor#submitApplication exception
> ----------------------------------------------------------------------
>
>                 Key: YARN-11322
>                 URL: https://issues.apache.org/jira/browse/YARN-11322
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: router
>            Reporter: FanXiaoyu
>            Priority: Major
>         Attachments: exception in router log.png, exception received by 
> client.png, image-2022-09-29-22-38-48-655.png
>
>
> If an application submittion failed due to its user configuration (e.g 
> invalid resource request), client will try each SubCluster till an exception 
> is thrown out, which shows no active SubCluster is available. This message 
> will mislead user to believe that something is going wrong on subclusters.
> This issue is trying to make exception more understandable by passing RM 
> exception message to router.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to