[
https://issues.apache.org/jira/browse/YARN-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
yimeng updated YARN-9157:
-------------------------
Description:
the Yarn task Excute failed , because excessive number of files under the path
yarn.nodemanager.local-dirs causes Inode to run out and calculates task failure
check the NM Logs , found that many localized dirs delete failed because of
user not found in security Systerm.
_2018-12-21 06:06:40,723 | INFO | AsyncDispatcher event handler | Cache Size
Before Clean: 240859897, Total Deleted: 85003, Public Deleted: 0, Private
Deleted: 85003 | ResourceLocalizationService.java:522_
_2018-12-21 06:06:40,744 | ERROR | DeletionService #1 | DeleteAsUser for
/srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48339
returned with exit code: 255 | LinuxContainerExecutor.java:565_
_ExitCodeException exitCode=255:_
_at org.apache.hadoop.util.Shell.runCommand(Shell.java:664)_
_at org.apache.hadoop.util.Shell.run(Shell.java:553)_
_at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:866)_
_at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:559)_
_at
org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:276)_
_at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)_
_at java.util.concurrent.FutureTask.run(FutureTask.java:266)_
_at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)_
_at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)_
_at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
_at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
_at java.lang.Thread.run(Thread.java:748)_
_2018-12-21 06:06:40,744 | ERROR | DeletionService #1 | Output from
LinuxContainerExecutor's deleteAsUser follows: |
LinuxContainerExecutor.java:567_
_2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : command provided
3 | ContainerExecutor.java:322_
_2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : run as user is
odaeuser | ContainerExecutor.java:322_
_2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : requested yarn
user is odaeuser | ContainerExecutor.java:322_
_2018-12-21 06:06:40,744 | INFO | DeletionService #1 | User odaeuser not found
| ContainerExecutor.java:322_
_2018-12-21 06:06:40,745 | INFO | DeletionService #1 | Deleting absolute path :
/srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48342 |
LinuxContainerExecutor.java:543_
_2018-12-21 06:06:40,749 | ERROR | DeletionService #2 | DeleteAsUser for
/srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48334
returned with exit code: 255 | LinuxContainerExecutor.java:565_
_ExitCodeException exitCode=255:_
_at org.apache.hadoop.util.Shell.runCommand(Shell.java:664)_
_at org.apache.hadoop.util.Shell.run(Shell.java:553)_
_at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:866)_
_at
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:559)_
_at
org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:276)_
_at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)_
_at java.util.concurrent.FutureTask.run(FutureTask.java:266)_
_at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)_
_at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)_
_at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
_at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
_at java.lang.Thread.run(Thread.java:748)_
_2018-12-21 06:06:40,749 | ERROR | DeletionService #2 | Output from
LinuxContainerExecutor's deleteAsUser follows: |
LinuxContainerExecutor.java:567_
_2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : command provided
3 | ContainerExecutor.java:322_
_2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : run as user is
odaeuser | ContainerExecutor.java:322_
_2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : requested yarn
user is odaeuser | ContainerExecutor.java:322_
_2018-12-21 06:06:40,749 | INFO | DeletionService #2 | User odaeuser not found
| ContainerExecutor.java:322_
actually the local dir files's size is 4.4GB, not 240859897B print in the log
!image-2018-12-25-10-03-51-070.png!
The user not found is because of our userInfo is saved in Ldap DB , when Ldap
Service have problem at some time , then get the user info will fail(not
because the user is deleted).When the Ldap Server recovery at some time , the
user info can get .
The problem is even we can get the user info later , the dirs that deleted
failed before will never be deleted later (it is deleted from the tracker list
), this cause the dirs accumulation .
I think NM ResourceLocalizationService should determine whether the file was
deleted successfully by Deletion Service Thread before deleting the directory
from tracker list and levelDB,if deleted failed ,we should add back it to
tracker list ,then delete the next dirs till the local dirs size is below
yarn.nodemanager.localizer.cache.target-size-mb
.
was:
the Yarn task Excute failed , because excessive number of files under the path
yarn.nodemanager.local-dirs causes Inode to run out and calculates task failure
!image-2018-12-25-09-53-15-067.png!
check the NM Logs , found that many localized dirs delete failed because of
user not found in security Systerm.
actually the local dir files's size is 4.4GB, not 240859897B print in the log
!image-2018-12-25-10-03-51-070.png!
The user not found is because of our userInfo is saved in Ldap DB , when Ldap
Service have problem at some time , then get the user info will fail(not
because the user is deleted).When the Ldap Server recovery at some time , the
user info can get .
The problem is even we can get the user info later , the dirs that deleted
failed before will never be deleted later (it is deleted from the tracker list
), this cause the dirs accumulation .
I think NM ResourceLocalizationService should determine whether the file was
deleted successfully by Deletion Service Thread before deleting the directory
from tracker list and levelDB,if deleted failed ,we should add back it to
tracker list ,then delete the next dirs till the local dirs size is below
yarn.nodemanager.localizer.cache.target-size-mb
.
> Failed deletion dirs in yarn.nodemanager.local-dirs causes accumulation
> lots of files under the path yarn.nodemanager.local-dirs and causes
> operation systerm's Inode to to be depleted
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-9157
> URL: https://issues.apache.org/jira/browse/YARN-9157
> Project: Hadoop YARN
> Issue Type: Bug
> Components: applications/distributed-shell
> Affects Versions: 2.7.2, 2.8.3, 2.7.5, 3.0.1, 3.1.1
> Reporter: yimeng
> Priority: Major
> Attachments: image-2018-12-25-10-03-51-070.png
>
>
> the Yarn task Excute failed , because excessive number of files under the
> path yarn.nodemanager.local-dirs causes Inode to run out and calculates task
> failure
> check the NM Logs , found that many localized dirs delete failed because of
> user not found in security Systerm.
> _2018-12-21 06:06:40,723 | INFO | AsyncDispatcher event handler | Cache Size
> Before Clean: 240859897, Total Deleted: 85003, Public Deleted: 0, Private
> Deleted: 85003 | ResourceLocalizationService.java:522_
> _2018-12-21 06:06:40,744 | ERROR | DeletionService #1 | DeleteAsUser for
> /srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48339
> returned with exit code: 255 | LinuxContainerExecutor.java:565_
> _ExitCodeException exitCode=255:_
> _at org.apache.hadoop.util.Shell.runCommand(Shell.java:664)_
> _at org.apache.hadoop.util.Shell.run(Shell.java:553)_
> _at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:866)_
> _at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:559)_
> _at
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:276)_
> _at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)_
> _at java.util.concurrent.FutureTask.run(FutureTask.java:266)_
> _at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)_
> _at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)_
> _at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
> _at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
> _at java.lang.Thread.run(Thread.java:748)_
> _2018-12-21 06:06:40,744 | ERROR | DeletionService #1 | Output from
> LinuxContainerExecutor's deleteAsUser follows: |
> LinuxContainerExecutor.java:567_
> _2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : command
> provided 3 | ContainerExecutor.java:322_
> _2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : run as user is
> odaeuser | ContainerExecutor.java:322_
> _2018-12-21 06:06:40,744 | INFO | DeletionService #1 | main : requested yarn
> user is odaeuser | ContainerExecutor.java:322_
> _2018-12-21 06:06:40,744 | INFO | DeletionService #1 | User odaeuser not
> found | ContainerExecutor.java:322_
> _2018-12-21 06:06:40,745 | INFO | DeletionService #1 | Deleting absolute path
> : /srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48342 |
> LinuxContainerExecutor.java:543_
> _2018-12-21 06:06:40,749 | ERROR | DeletionService #2 | DeleteAsUser for
> /srv/BigData/hadoop/data1/nm/localdir/usercache/odaeuser/filecache/48334
> returned with exit code: 255 | LinuxContainerExecutor.java:565_
> _ExitCodeException exitCode=255:_
> _at org.apache.hadoop.util.Shell.runCommand(Shell.java:664)_
> _at org.apache.hadoop.util.Shell.run(Shell.java:553)_
> _at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:866)_
> _at
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:559)_
> _at
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$FileDeletionTask.run(DeletionService.java:276)_
> _at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)_
> _at java.util.concurrent.FutureTask.run(FutureTask.java:266)_
> _at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)_
> _at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)_
> _at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)_
> _at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)_
> _at java.lang.Thread.run(Thread.java:748)_
> _2018-12-21 06:06:40,749 | ERROR | DeletionService #2 | Output from
> LinuxContainerExecutor's deleteAsUser follows: |
> LinuxContainerExecutor.java:567_
> _2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : command
> provided 3 | ContainerExecutor.java:322_
> _2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : run as user is
> odaeuser | ContainerExecutor.java:322_
> _2018-12-21 06:06:40,749 | INFO | DeletionService #2 | main : requested yarn
> user is odaeuser | ContainerExecutor.java:322_
> _2018-12-21 06:06:40,749 | INFO | DeletionService #2 | User odaeuser not
> found | ContainerExecutor.java:322_
>
> actually the local dir files's size is 4.4GB, not 240859897B print in the log
> !image-2018-12-25-10-03-51-070.png!
> The user not found is because of our userInfo is saved in Ldap DB , when
> Ldap Service have problem at some time , then get the user info will fail(not
> because the user is deleted).When the Ldap Server recovery at some time , the
> user info can get .
> The problem is even we can get the user info later , the dirs that deleted
> failed before will never be deleted later (it is deleted from the tracker
> list ), this cause the dirs accumulation .
> I think NM ResourceLocalizationService should determine whether the file was
> deleted successfully by Deletion Service Thread before deleting the directory
> from tracker list and levelDB,if deleted failed ,we should add back it to
> tracker list ,then delete the next dirs till the local dirs size is below
> yarn.nodemanager.localizer.cache.target-size-mb
> .
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]