[
https://issues.apache.org/jira/browse/YARN-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hu Ziqian updated YARN-9947:
----------------------------
Description:
This issue introduce an method to lazy init appLogAggregatorImpl, which let it
access hdfs as later as possible (when the app finish usually), to avoid access
hdfs at same time when restart NMs in a cluster and reduce hdfs pressure. Lets
go into the details below.
In current version, app log aggregator will check HDFS and try to create log
app when init an app. This cause a problem when restart NMs in a large cluster
with a heavy hdfs. Restart NM will init all apps on a NM and the NM will try to
connect HDFS. If the HDFS is heavily loaded, many NMs restart at same time will
let the hdfs not respond. The NM will wait for HDFS's response and RM can't get
NM's heartbeat and treat all containers as timeout.
In our product environment with 3500+ NMs, we find the NMs restart will put
heavy pressure on HDFS and the init app's operation is blocked on accessing
hdfs (stack attached blow), which causes all the container failed (we can find
the container number in one NM fall to zero).
!https://teambition-file.alibaba-inc.com/storage/011mcaf1aebf84f02a5d2c2c5fa85af80f5b?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW1jYWYxYWViZjg0ZjAyYTVkMmMyYzVmYTg1YWY4MGY1YiJ9.JJQoQvjWdAQItQkjtdxa1SnkqJWuij_w2xq2Unoaktg!
!https://teambition-file.alibaba-inc.com/storage/011m873079212ee7fe507ddbe163a0c07fb1?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW04NzMwNzkyMTJlZTdmZTUwN2RkYmUxNjNhMGMwN2ZiMSJ9.kH73n6bdx8ETXsrWcBGgXGay2WP3z9nzuDlE8-RvQzs!
We solve this problem by introduce lazy initialization in appLogAggregatorImpl.
When init app, we just create appLogAggregatorImpl object with out
verifyAndCreateRemoteLogDir(). We do the verifyAndCreateRemoteLogDir() when the
app start aggregate logs. Because apps always are not finish or aggregate log
at same time, the verifyAndCreateRemoteLogDir will execute dispersedly, which
means NMs will not access hdfs at same time when they restart at same time.
YARN-8418, solve the container logs' directory leaked problem by add a way to
update credentials of NM. If we lazy init appLogAggregatorImpl, we don't need
YARN-8418's logic because the lazy init logic happens after addCredentials
logic, which means the credentials always refreshed before we use it.
In summary, this issue do two things:
# Introducing a lazy init logic to appLogAggregatorImpl to avoid centralized
access HDFS when restart all NMs in a cluster.
# Reverting YARN-8481 because the lazy init logic guarantee refreshing the
credentials.
was:
This issue introduce an method to lazy init
In current version, app log aggregator will check HDFS and try to create log
app when init an app. This cause a problem when restart NMs in a large cluster
with a heavy hdfs. Restart NM will init all app on a NM and the NM will try to
connect HDFS. If the HDFS is heavily loaded, many NMs restart at same time will
let the hdfs not respond. The NM will wait for HDFS's response and RM can't get
NM's heartbeat and treat all containers as timeout.
In our product environment with 3500+ NMs, we find the NMs restart will put
heavy pressure on HDFS and the init app's operation is blocked on accessing
hdfs (stack attached blow), which causes all the container failed (we can find
the container number in one NM fall to zero).
!https://teambition-file.alibaba-inc.com/storage/011mcaf1aebf84f02a5d2c2c5fa85af80f5b?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW1jYWYxYWViZjg0ZjAyYTVkMmMyYzVmYTg1YWY4MGY1YiJ9.JJQoQvjWdAQItQkjtdxa1SnkqJWuij_w2xq2Unoaktg!
!https://teambition-file.alibaba-inc.com/storage/011m873079212ee7fe507ddbe163a0c07fb1?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW04NzMwNzkyMTJlZTdmZTUwN2RkYmUxNjNhMGMwN2ZiMSJ9.kH73n6bdx8ETXsrWcBGgXGay2WP3z9nzuDlE8-RvQzs!
> lazy init appLogAggregatorImpl when log aggregation
> ---------------------------------------------------
>
> Key: YARN-9947
> URL: https://issues.apache.org/jira/browse/YARN-9947
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Affects Versions: 3.1.3
> Reporter: Hu Ziqian
> Assignee: Hu Ziqian
> Priority: Major
>
> This issue introduce an method to lazy init appLogAggregatorImpl, which let
> it access hdfs as later as possible (when the app finish usually), to avoid
> access hdfs at same time when restart NMs in a cluster and reduce hdfs
> pressure. Lets go into the details below.
> In current version, app log aggregator will check HDFS and try to create log
> app when init an app. This cause a problem when restart NMs in a large
> cluster with a heavy hdfs. Restart NM will init all apps on a NM and the NM
> will try to connect HDFS. If the HDFS is heavily loaded, many NMs restart at
> same time will let the hdfs not respond. The NM will wait for HDFS's response
> and RM can't get NM's heartbeat and treat all containers as timeout.
> In our product environment with 3500+ NMs, we find the NMs restart will put
> heavy pressure on HDFS and the init app's operation is blocked on accessing
> hdfs (stack attached blow), which causes all the container failed (we can
> find the container number in one NM fall to zero).
> !https://teambition-file.alibaba-inc.com/storage/011mcaf1aebf84f02a5d2c2c5fa85af80f5b?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW1jYWYxYWViZjg0ZjAyYTVkMmMyYzVmYTg1YWY4MGY1YiJ9.JJQoQvjWdAQItQkjtdxa1SnkqJWuij_w2xq2Unoaktg!
> !https://teambition-file.alibaba-inc.com/storage/011m873079212ee7fe507ddbe163a0c07fb1?download=upload_tfs_by_description.png&Signature=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJBcHBJRCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9hcHBJZCI6IjVjZDkwOTdmYjNhNDMyMjk3OTBhN2EyZiIsIl9vcmdhbml6YXRpb25JZCI6IjVjNDA1N2YwYmU4MjViMzkwNjY3YWJlZSIsImV4cCI6MTU3MjgzNzQxMywiaWF0IjoxNTcyODM3MTEzLCJyZXNvdXJjZSI6Ii9zdG9yYWdlLzAxMW04NzMwNzkyMTJlZTdmZTUwN2RkYmUxNjNhMGMwN2ZiMSJ9.kH73n6bdx8ETXsrWcBGgXGay2WP3z9nzuDlE8-RvQzs!
> We solve this problem by introduce lazy initialization in
> appLogAggregatorImpl. When init app, we just create appLogAggregatorImpl
> object with out verifyAndCreateRemoteLogDir(). We do the
> verifyAndCreateRemoteLogDir() when the app start aggregate logs. Because apps
> always are not finish or aggregate log at same time, the
> verifyAndCreateRemoteLogDir will execute dispersedly, which means NMs will
> not access hdfs at same time when they restart at same time.
>
> YARN-8418, solve the container logs' directory leaked problem by add a way
> to update credentials of NM. If we lazy init appLogAggregatorImpl, we don't
> need YARN-8418's logic because the lazy init logic happens after
> addCredentials logic, which means the credentials always refreshed before we
> use it.
>
> In summary, this issue do two things:
> # Introducing a lazy init logic to appLogAggregatorImpl to avoid centralized
> access HDFS when restart all NMs in a cluster.
> # Reverting YARN-8481 because the lazy init logic guarantee refreshing the
> credentials.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]