Hello,
If a resource is localised on a disk and that disk has gone bad after
localising, subsequent containers are not able to find the resource and NM
does not download it again.
The problem is stat system call succeeds on the bad path which causes
file.exists() to return true.
But ls on the path returns an IO error.
LocalResourcesTrackerImpl.java
case REQUEST:
if (rsrc != null && (!isResourcePresent(rsrc))) {
LOG.info("Resource " + rsrc.getLocalPath()
+ " is missing, localizing it again");
removeResource(req);
rsrc = null;
}
if (null == rsrc) {
rsrc = new LocalizedResource(req, dispatcher);
localrsrc.put(req, rsrc);
}
break;
isResourcePresent() calls file.exists() which calls stat64 natively which
returns true.. But the disk actually is bad, and there is no possibility of
reading/writing on that path.
example:
>>stat /data/d3/yarn/local
File: `/data/d3/yarn/local'
Size: 4096 Blocks: 8 IO Block: 4096 directory
Device: 830h/2096d Inode: 107307009 Links: 3
Access: (0755/drwxr-xr-x) Uid: ( 110/ yarn) Gid: ( 118/ hadoop)
Access: 2014-11-18 13:57:19.000000000 +0000
Modify: 2014-11-19 11:15:15.000000000 +0000
Change: 2014-11-19 11:15:15.000000000 +0000
Birth: -
and ls says
ls: reading directory /data/d3/mapred: Input/output error
Any thoughts?
Thanks