Kang Lee created YARN-11396:
-------------------------------
Summary: Used resource of user may be incorrect when flink's job
manager retry
Key: YARN-11396
URL: https://issues.apache.org/jira/browse/YARN-11396
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 2.10.1
Reporter: Kang Lee
Attachments: image-2022-12-14-14-37-09-463.png
Run flink job on YARN 2.10.1 using the capacity scheduler,used resource of user
is incorrect when job manager failed and attempt.
Reproduce this issue:
1. Create a capacity_test queue. The queue resource is following:
{code:java}
Queue State: RUNNING
Used Capacity: <memory:4096, vCores:4> (84.7%)
Configured Capacity: <memory:0, vCores:0>
Configured Max Capacity: unlimited
Effective Capacity: <memory:20479, vCores:4> (4.0%)
Effective Max Capacity: <memory:512000, vCores:118> (100.0%)
Absolute Used Capacity: 3.4%
Absolute Configured Capacity: 4.0%
Absolute Configured Max Capacity: 100.0%
Used Resources: <memory:4096, vCores:4>
{code}
2. Sumbit a flink job to yarn with parallelism is 10 and contaianer resource
is 1c 1024m.
{code:java}
flink run -m yarn-cluster -yjm 1024 -ytm 1024 -parallelism 10 -yqu
capacity_test /cloud/service/flink/examples/streaming/WindowJoin.ja {code}
Becuase user's max resource of this queue is 4c, 10g, so this job only
can runnning 5 containers, at this moment, used resource of this user is
following
||User Name||Max Resource||Weight||Used Resource||Max AM Resource||Used AM
Resource||Schedulable Apps||Non-Schedulable Apps||
|hadoop|*<memory:20480, vCores:4>*|1.0|<memory:5120, vCores:5>|<memory:10240,
vCores:2>|<memory:2048, vCores:2>|2|
3. kill -9 the process of job manager, so this application of attempt will be
removed by yarn, and the user will be remove form UserManager as well.
In method of LeafQueue#removeApplicationAttempt, when user's total
applications is 0, the user will be remove from usersManager.
{code:java}
private void removeApplicationAttempt(
FiCaSchedulerApp application, String userName) {
try {
writeLock.lock();
//...
user.finishApplication(wasActive);
if (user.getTotalApplications() == 0) {
usersManager.removeUser(application.getUser());
}
//...
}{code}
4. A new job manager will be attempted , so the User object of hadoop will be
recreate, and used resource of user is initialize to 0. As the same time, in
flink job, the value of
ApplicationSubmissionContextProto#keep_containers_across_application_attempts
is true, old containers can still running, and this part of resources is not
compute in recreated user. So used resource of user is incorrect and real used
resource more than max resource can bu used,like this
!image-2022-12-14-14-37-09-463.png|width=1192,height=532!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]