Decommissioned datanode is counted in service cause datanode allcating failure
Hi, When allocate a datanode when dfsclient write with considering the load, it checks if the datanode is overloaded by calculating the average xceivers of all the in service datanode. But if the datanode is decommissioned and become dead, it's still treated as in service, which make the average load much more than the real one especially when the number of the decommissioned datanode is great. In our cluster, 180 datanode, and 100 of them decommissioned, and the average load is 17. This failed all the datanode allocation. I have created a jira HDFS-12820. But not very sure the solution. My concern is why we don't exclude the decommissioned datanode from the nodesInService originally? If we simply remove this, could there be any side effect? private void subtract(final DatanodeDescriptor node) { capacityUsed -= node.getDfsUsed(); blockPoolUsed -= node.getBlockPoolUsed(); xceiverCount -= node.getXceiverCount(); if (!(node.isDecommissionInProgress() || node.isDecommissioned())) { nodesInService--; nodesInServiceXceiverCount -= node.getXceiverCount(); capacityTotal -= node.getCapacity(); capacityRemaining -= node.getRemaining(); } else { capacityTotal -= node.getDfsUsed(); } cacheCapacity -= node.getCacheCapacity(); cacheUsed -= node.getCacheUsed(); } -- Xie Gang
Block invalid IOException causes the DFSClient domain socket being disabled
Hi, We use HDFS2.4 & 2.6, and recently hit a issue that DFSClient domain socket is disabled when datanode throw block invalid exception. The block is invalidated for some reason on datanote and it's OK. Then DFSClient tries to access this block on this datanode via domain socket. This triggers a IOExcetion. On DFSClient side, when get a IOExcetion and error code 'ERROR', it disables the domain socket and fails back to TCP. and the worst is that it seems never recover the socket. I think this is a defect and with such "block invalid" exception, we should not disable the domain socket because the is nothing wrong about the domain socket service. And thoughts? The code: private ShortCircuitReplicaInfo requestFileDescriptors(DomainPeer peer, Slot slot) throws IOException { ShortCircuitCache cache = clientContext.getShortCircuitCache(); final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(peer.getOutputStream())); SlotId slotId = slot == null ? null : slot.getSlotId(); new Sender(out).requestShortCircuitFds(block, token, slotId, 1); DataInputStream in = new DataInputStream(peer.getInputStream()); BlockOpResponseProto resp = BlockOpResponseProto.parseFrom( PBHelper.vintPrefixed(in)); DomainSocket sock = peer.getDomainSocket(); switch (resp.getStatus()) { case SUCCESS: byte buf[] = new byte[1]; FileInputStream fis[] = new FileInputStream[2]; sock.recvFileInputStreams(fis, buf, 0, buf.length); ShortCircuitReplica replica = null; try { ExtendedBlockId key = new ExtendedBlockId(block.getBlockId(), block.getBlockPoolId()); replica = new ShortCircuitReplica(key, fis[0], fis[1], cache, Time.monotonicNow(), slot); } catch (IOException e) { // This indicates an error reading from disk, or a format error. Since // it's not a socket communication problem, we return null rather than // throwing an exception. LOG.warn(this + ": error creating ShortCircuitReplica.", e); return null; } finally { if (replica == null) { IOUtils.cleanup(DFSClient.LOG, fis[0], fis[1]); } } return new ShortCircuitReplicaInfo(replica); case ERROR_UNSUPPORTED: if (!resp.hasShortCircuitAccessVersion()) { LOG.warn("short-circuit read access is disabled for " + "DataNode " + datanode + ". reason: " + resp.getMessage()); clientContext.getDomainSocketFactory() .disableShortCircuitForPath(pathInfo.getPath()); } else { LOG.warn("short-circuit read access for the file " + fileName + " is disabled for DataNode " + datanode + ". reason: " + resp.getMessage()); } return null; case ERROR_ACCESS_TOKEN: String msg = "access control error while " + "attempting to set up short-circuit access to " + fileName + resp.getMessage(); if (LOG.isDebugEnabled()) { LOG.debug(this + ":" + msg); } return new ShortCircuitReplicaInfo(new InvalidToken(msg)); default: LOG.warn(this + ": unknown response code " + resp.getStatus() + " while attempting to set up short-circuit access. " + resp.getMessage()); clientContext.getDomainSocketFactory() .disableShortCircuitForPath(pathInfo.getPath()); return null; } -- Xie Gang
Re: Block invalid IOException causes the DFSClient domain socket being disabled
Shall I create the jira directly? On Thu, Oct 26, 2017 at 12:34 PM, Xie Gang <xiegang...@gmail.com> wrote: > Hi, > > We use HDFS2.4 & 2.6, and recently hit a issue that DFSClient domain > socket is disabled when datanode throw block invalid exception. > > The block is invalidated for some reason on datanote and it's OK. Then > DFSClient tries to access this block on this datanode via domain socket. > This triggers a IOExcetion. On DFSClient side, when get a IOExcetion and > error code 'ERROR', it disables the domain socket and fails back to TCP. > and the worst is that it seems never recover the socket. > > I think this is a defect and with such "block invalid" exception, we > should not disable the domain socket because the is nothing wrong about the > domain socket service. > > And thoughts? > > The code: > > private ShortCircuitReplicaInfo requestFileDescriptors(DomainPeer peer, > Slot slot) throws IOException { > ShortCircuitCache cache = clientContext.getShortCircuitCache(); > final DataOutputStream out = > new DataOutputStream(new BufferedOutputStream(peer.getOutputStream())); > SlotId slotId = slot == null ? null : slot.getSlotId(); > new Sender(out).requestShortCircuitFds(block, token, slotId, 1); > DataInputStream in = new DataInputStream(peer.getInputStream()); > BlockOpResponseProto resp = BlockOpResponseProto.parseFrom( > PBHelper.vintPrefixed(in)); > DomainSocket sock = peer.getDomainSocket(); > switch (resp.getStatus()) { > case SUCCESS: > byte buf[] = new byte[1]; > FileInputStream fis[] = new FileInputStream[2]; > sock.recvFileInputStreams(fis, buf, 0, buf.length); > ShortCircuitReplica replica = null; > try { > ExtendedBlockId key = > new ExtendedBlockId(block.getBlockId(), block.getBlockPoolId()); > replica = new ShortCircuitReplica(key, fis[0], fis[1], cache, > Time.monotonicNow(), slot); > } catch (IOException e) { > // This indicates an error reading from disk, or a format error. Since > // it's not a socket communication problem, we return null rather than > // throwing an exception. > LOG.warn(this + ": error creating ShortCircuitReplica.", e); > return null; > } finally { > if (replica == null) { > IOUtils.cleanup(DFSClient.LOG, fis[0], fis[1]); > } > } > return new ShortCircuitReplicaInfo(replica); > case ERROR_UNSUPPORTED: > if (!resp.hasShortCircuitAccessVersion()) { > LOG.warn("short-circuit read access is disabled for " + > "DataNode " + datanode + ". reason: " + resp.getMessage()); > clientContext.getDomainSocketFactory() > .disableShortCircuitForPath(pathInfo.getPath()); > } else { > LOG.warn("short-circuit read access for the file " + > fileName + " is disabled for DataNode " + datanode + > ". reason: " + resp.getMessage()); > } > return null; > case ERROR_ACCESS_TOKEN: > String msg = "access control error while " + > "attempting to set up short-circuit access to " + > fileName + resp.getMessage(); > if (LOG.isDebugEnabled()) { > LOG.debug(this + ":" + msg); > } > return new ShortCircuitReplicaInfo(new InvalidToken(msg)); > default: > LOG.warn(this + ": unknown response code " + resp.getStatus() + > " while attempting to set up short-circuit access. " + > resp.getMessage()); > clientContext.getDomainSocketFactory() > .disableShortCircuitForPath(pathInfo.getPath()); > return null; > } > > > > -- > Xie Gang > -- Xie Gang
Re: Inconsistence between the datanode volume info and OS df
Got the root cause, it's a dup of HDFS-8072 https://issues.apache.org/jira/browse/HDFS-8072 On Wed, Jan 10, 2018 at 2:20 PM, Xie Gang <xiegang...@gmail.com> wrote: > Hi, > > Recently, we hit an issue that, there is a difference between the > freeSpace of the datanode volume info and the OS df: > > For example: > the jmx of the dn shows: > > "VolumeInfo" : > "{\"/\":{\"freeSpace\":1445398864500,\"usedSpace\":228138206927,\"reservedSpace\":53687091200}}", > > But the df shows: > /dev/sda 2146676656 <(214)%20667-6656> 253778008 1785508084 13% /... > > There is about 400GB gap which is regarded as Non DFS used. And the most > strange thing is that, after I restart the dn process, the gap disappear. > And after some days, the gap shows again. > > The yarn shared the same server of the dn and has some file cache. Could > it be related? > > The direct cause is that the freeSpace from dn is quit different from the > available space from df. After tracking down the code, freeSpace of the dn > is from dirFile.getUsableSpace(). could it have some problem? Do we hit > this issue before? > > Thanks, > Gang > > > -- > Xie Gang > -- Xie Gang
Inconsistence between the datanode volume info and OS df
Hi, Recently, we hit an issue that, there is a difference between the freeSpace of the datanode volume info and the OS df: For example: the jmx of the dn shows: "VolumeInfo" : "{\"/\":{\"freeSpace\":1445398864500,\"usedSpace\":228138206927,\"reservedSpace\":53687091200}}", But the df shows: /dev/sda 2146676656 253778008 1785508084 13% /... There is about 400GB gap which is regarded as Non DFS used. And the most strange thing is that, after I restart the dn process, the gap disappear. And after some days, the gap shows again. The yarn shared the same server of the dn and has some file cache. Could it be related? The direct cause is that the freeSpace from dn is quit different from the available space from df. After tracking down the code, freeSpace of the dn is from dirFile.getUsableSpace(). could it have some problem? Do we hit this issue before? Thanks, Gang -- Xie Gang
Re: does it make sense to get remaining space by sum all the ones of the datanode
2.4 and 2.6: public long getRemaining(StorageType t) { long remaining = 0; for(DatanodeStorageInfo s : getStorageInfos()) { if (s.getStorageType() == t) { remaining += s.getRemaining(); } } return remaining; } On Mon, Jan 29, 2018 at 6:12 PM, Vinayakumar B <vinayakum...@apache.org> wrote: > in which version of Hadoop you are seeing this? > > -Vinay > > On 29 Jan 2018 3:26 pm, "Xie Gang" <xiegang...@gmail.com> wrote: > > Hello, > > We recently hit a issue that almost all the disk of the datanode got full > even we configured the du .reserved. > > After tracking down the code, found that when we choose a target datanode > and check if it's good candidate for block allocation (isGoodTarget()), it > only checks if the total left space of all the volumes(the same type), not > each volume. This logic makes the reservation of each volume useless. > Is this a problem or do I have any misunderstanding? > > final long remaining = node.getRemaining(storage.getStorageType()); > if (requiredSize > remaining - scheduledSize) { > logNodeIsNotChosen(storage, "the node does not have enough " > + storage.getStorageType() + " space" > + " (required=" + requiredSize > + ", scheduled=" + scheduledSize > + ", remaining=" + remaining + ")"); > stats.incrOverScheduled(); > return false; > } > > > > -- > Xie Gang > -- Xie Gang
does it make sense to get remaining space by sum all the ones of the datanode
Hello, We recently hit a issue that almost all the disk of the datanode got full even we configured the du .reserved. After tracking down the code, found that when we choose a target datanode and check if it's good candidate for block allocation (isGoodTarget()), it only checks if the total left space of all the volumes(the same type), not each volume. This logic makes the reservation of each volume useless. Is this a problem or do I have any misunderstanding? final long remaining = node.getRemaining(storage.getStorageType()); if (requiredSize > remaining - scheduledSize) { logNodeIsNotChosen(storage, "the node does not have enough " + storage.getStorageType() + " space" + " (required=" + requiredSize + ", scheduled=" + scheduledSize + ", remaining=" + remaining + ")"); stats.incrOverScheduled(); return false; } -- Xie Gang
enable the SC local read to UC block to optimize the read perf
Hello, We disable the SC local read to UC block in current implementation. This is not optimized for some application with access mode that it almost read what's written recently. With this access pattern, the usage of the SC read downgrades greatly and the perf is impacted. An idea is to enable SC local read to the UC block. If the app could ensure that 1. before each read of the file, hsync always is always called. 2. And the range of the read is not beyond the write. It think if the 2 above met, there should be no problem. Does this make sense? Actually, I did a prototype and succeeded, and will look into it further. But not sure if we tried this before. -- Xie Gang
Why to set socket read timeout to n*socketTimeout in data transfer
Hello, I find when we transfer data (DataTransfer), we set the socket read timeout to *targets.length * socketTimeout*, but not socketTimeout + READ_TIMEOUT_EXTENSION * (targets.length-1) as we do in the setting of the pipeline. What's the purpose of this setting? public void doRun() throws IOException { xmitsInProgress.getAndIncrement(); Socket sock = null; DataOutputStream out = null; DataInputStream in = null; BlockSender blockSender = null; final boolean isClient = clientname.length() > 0; try { final String dnAddr = targets[0].getXferAddr(connectToDnViaHostname); InetSocketAddress curTarget = NetUtils.createSocketAddr(dnAddr); if (LOG.isDebugEnabled()) { LOG.debug("Connecting to datanode " + dnAddr); } sock = newSocket(); NetUtils.connect(sock, curTarget, dnConf.socketTimeout); *sock.setSoTimeout(targets.length * dnConf.socketTimeout);<<<<-* long writeTimeout = dnConf.socketWriteTimeout + HdfsServerConstants.WRITE_TIMEOUT_EXTENSION * (targets.length-1); -- Xie Gang
Why always allocate shm slot when local read even if no zero copy needed?
Hello, It seems that we always allocate shm slot when local read, even if we only do the SC read without zero copy. Can we save this if no zero copy needed? According to my understanding, the shm slot is not used if we don't do the zero copy when local read, is that right? public ShortCircuitReplica(ExtendedBlockId key, FileInputStream dataStream, FileInputStream metaStream, ShortCircuitCache cache, long creationTimeMs, Slot slot) throws IOException { -- Xie Gang
How to access 2 HDFS with difference version in one app
Hi, There is a requirement that we need access 2 HDFS with different version(2.0 and 2.4). We shade 2.0 on the client side. But found that, due to the shade (change of the package path), the namenode/datanode (without any shade), could not handle the rpc with unkown package name. So, is there any other way to do this? The rough idea is to change the RPC engine to change the shaded package name back to the original one. but not sure if it could work. -- Xie Gang