Decommissioned datanode is counted in service cause datanode allcating failure

2017-11-15 Thread Xie Gang
Hi,

When allocate a datanode when dfsclient write with considering the load, it
checks if the datanode is overloaded by calculating the average xceivers of
all the in service datanode. But if the datanode is decommissioned and
become dead, it's still treated as in service, which make the average load
much more than the real one especially when the number of the
decommissioned datanode is great. In our cluster, 180 datanode, and 100 of
them decommissioned, and the average load is 17. This failed all the
datanode allocation.


I have created a jira HDFS-12820. But not very sure the solution. My
concern is why we don't exclude the decommissioned datanode from the
nodesInService originally? If we simply remove this, could there be any
side effect?

private void subtract(final DatanodeDescriptor node) {
capacityUsed -= node.getDfsUsed();
blockPoolUsed -= node.getBlockPoolUsed();
xceiverCount -= node.getXceiverCount();
if (!(node.isDecommissionInProgress() || node.isDecommissioned())) {
nodesInService--;
nodesInServiceXceiverCount -= node.getXceiverCount();
capacityTotal -= node.getCapacity();
capacityRemaining -= node.getRemaining();
} else
{ capacityTotal -= node.getDfsUsed(); }

cacheCapacity -= node.getCacheCapacity();
cacheUsed -= node.getCacheUsed();
}

-- 
Xie Gang


Block invalid IOException causes the DFSClient domain socket being disabled

2017-10-25 Thread Xie Gang
Hi,

We use HDFS2.4 & 2.6, and recently hit a issue that DFSClient domain socket
is disabled when datanode throw block invalid exception.

The block is invalidated for some reason on datanote and it's OK. Then
DFSClient tries to access this block on this datanode via domain socket.
This triggers a IOExcetion. On DFSClient side, when get a IOExcetion and
error code 'ERROR', it disables the domain socket and fails back to TCP.
and the worst is that it seems never recover the socket.

I think this is a defect and with such "block invalid" exception, we should
not disable the domain socket because the is nothing wrong about the domain
socket service.

And thoughts?

The code:

private ShortCircuitReplicaInfo requestFileDescriptors(DomainPeer peer,
Slot slot) throws IOException {
  ShortCircuitCache cache = clientContext.getShortCircuitCache();
  final DataOutputStream out =
  new DataOutputStream(new BufferedOutputStream(peer.getOutputStream()));
  SlotId slotId = slot == null ? null : slot.getSlotId();
  new Sender(out).requestShortCircuitFds(block, token, slotId, 1);
  DataInputStream in = new DataInputStream(peer.getInputStream());
  BlockOpResponseProto resp = BlockOpResponseProto.parseFrom(
  PBHelper.vintPrefixed(in));
  DomainSocket sock = peer.getDomainSocket();
  switch (resp.getStatus()) {
  case SUCCESS:
byte buf[] = new byte[1];
FileInputStream fis[] = new FileInputStream[2];
sock.recvFileInputStreams(fis, buf, 0, buf.length);
ShortCircuitReplica replica = null;
try {
  ExtendedBlockId key =
  new ExtendedBlockId(block.getBlockId(), block.getBlockPoolId());
  replica = new ShortCircuitReplica(key, fis[0], fis[1], cache,
  Time.monotonicNow(), slot);
} catch (IOException e) {
  // This indicates an error reading from disk, or a format error.  Since
  // it's not a socket communication problem, we return null rather than
  // throwing an exception.
  LOG.warn(this + ": error creating ShortCircuitReplica.", e);
  return null;
} finally {
  if (replica == null) {
IOUtils.cleanup(DFSClient.LOG, fis[0], fis[1]);
  }
}
return new ShortCircuitReplicaInfo(replica);
  case ERROR_UNSUPPORTED:
if (!resp.hasShortCircuitAccessVersion()) {
  LOG.warn("short-circuit read access is disabled for " +
  "DataNode " + datanode + ".  reason: " + resp.getMessage());
  clientContext.getDomainSocketFactory()
  .disableShortCircuitForPath(pathInfo.getPath());
} else {
  LOG.warn("short-circuit read access for the file " +
  fileName + " is disabled for DataNode " + datanode +
  ".  reason: " + resp.getMessage());
}
return null;
  case ERROR_ACCESS_TOKEN:
String msg = "access control error while " +
"attempting to set up short-circuit access to " +
fileName + resp.getMessage();
if (LOG.isDebugEnabled()) {
  LOG.debug(this + ":" + msg);
}
return new ShortCircuitReplicaInfo(new InvalidToken(msg));
  default:
LOG.warn(this + ": unknown response code " + resp.getStatus() +
" while attempting to set up short-circuit access. " +
resp.getMessage());
    clientContext.getDomainSocketFactory()
.disableShortCircuitForPath(pathInfo.getPath());
return null;
  }



-- 
Xie Gang


Re: Block invalid IOException causes the DFSClient domain socket being disabled

2017-10-26 Thread Xie Gang
Shall I create the jira directly?

On Thu, Oct 26, 2017 at 12:34 PM, Xie Gang <xiegang...@gmail.com> wrote:

> Hi,
>
> We use HDFS2.4 & 2.6, and recently hit a issue that DFSClient domain
> socket is disabled when datanode throw block invalid exception.
>
> The block is invalidated for some reason on datanote and it's OK. Then
> DFSClient tries to access this block on this datanode via domain socket.
> This triggers a IOExcetion. On DFSClient side, when get a IOExcetion and
> error code 'ERROR', it disables the domain socket and fails back to TCP.
> and the worst is that it seems never recover the socket.
>
> I think this is a defect and with such "block invalid" exception, we
> should not disable the domain socket because the is nothing wrong about the
> domain socket service.
>
> And thoughts?
>
> The code:
>
> private ShortCircuitReplicaInfo requestFileDescriptors(DomainPeer peer,
> Slot slot) throws IOException {
>   ShortCircuitCache cache = clientContext.getShortCircuitCache();
>   final DataOutputStream out =
>   new DataOutputStream(new BufferedOutputStream(peer.getOutputStream()));
>   SlotId slotId = slot == null ? null : slot.getSlotId();
>   new Sender(out).requestShortCircuitFds(block, token, slotId, 1);
>   DataInputStream in = new DataInputStream(peer.getInputStream());
>   BlockOpResponseProto resp = BlockOpResponseProto.parseFrom(
>   PBHelper.vintPrefixed(in));
>   DomainSocket sock = peer.getDomainSocket();
>   switch (resp.getStatus()) {
>   case SUCCESS:
> byte buf[] = new byte[1];
> FileInputStream fis[] = new FileInputStream[2];
> sock.recvFileInputStreams(fis, buf, 0, buf.length);
> ShortCircuitReplica replica = null;
> try {
>   ExtendedBlockId key =
>   new ExtendedBlockId(block.getBlockId(), block.getBlockPoolId());
>   replica = new ShortCircuitReplica(key, fis[0], fis[1], cache,
>   Time.monotonicNow(), slot);
> } catch (IOException e) {
>   // This indicates an error reading from disk, or a format error.  Since
>   // it's not a socket communication problem, we return null rather than
>   // throwing an exception.
>   LOG.warn(this + ": error creating ShortCircuitReplica.", e);
>   return null;
> } finally {
>   if (replica == null) {
> IOUtils.cleanup(DFSClient.LOG, fis[0], fis[1]);
>   }
> }
> return new ShortCircuitReplicaInfo(replica);
>   case ERROR_UNSUPPORTED:
> if (!resp.hasShortCircuitAccessVersion()) {
>   LOG.warn("short-circuit read access is disabled for " +
>   "DataNode " + datanode + ".  reason: " + resp.getMessage());
>   clientContext.getDomainSocketFactory()
>   .disableShortCircuitForPath(pathInfo.getPath());
> } else {
>   LOG.warn("short-circuit read access for the file " +
>   fileName + " is disabled for DataNode " + datanode +
>   ".  reason: " + resp.getMessage());
> }
> return null;
>   case ERROR_ACCESS_TOKEN:
> String msg = "access control error while " +
> "attempting to set up short-circuit access to " +
> fileName + resp.getMessage();
> if (LOG.isDebugEnabled()) {
>   LOG.debug(this + ":" + msg);
> }
> return new ShortCircuitReplicaInfo(new InvalidToken(msg));
>   default:
> LOG.warn(this + ": unknown response code " + resp.getStatus() +
> " while attempting to set up short-circuit access. " +
> resp.getMessage());
> clientContext.getDomainSocketFactory()
> .disableShortCircuitForPath(pathInfo.getPath());
> return null;
>   }
>
>
>
> --
> Xie Gang
>



-- 
Xie Gang


Re: Inconsistence between the datanode volume info and OS df

2018-01-15 Thread Xie Gang
Got the root cause, it's a dup of HDFS-8072

https://issues.apache.org/jira/browse/HDFS-8072

On Wed, Jan 10, 2018 at 2:20 PM, Xie Gang <xiegang...@gmail.com> wrote:

> Hi,
>
> Recently, we hit an issue that, there is a difference between the
> freeSpace of the datanode volume info and the OS df:
>
> For example:
> the jmx of the dn shows:
>
>  "VolumeInfo" : 
> "{\"/\":{\"freeSpace\":1445398864500,\"usedSpace\":228138206927,\"reservedSpace\":53687091200}}",
>
> But the df shows:
> /dev/sda 2146676656 <(214)%20667-6656> 253778008 1785508084 13% /...
>
> There is about 400GB gap which is regarded as Non DFS used. And the most
> strange thing is that, after I restart the dn process, the gap disappear.
> And after some days, the gap shows again.
>
> The yarn shared the same server of the dn and has some file cache. Could
> it be related?
>
> The direct cause is that the freeSpace from dn is quit different from the
> available space from df. After tracking down the code, freeSpace of the dn
> is from dirFile.getUsableSpace(). could it have some problem? Do we hit
> this issue before?
>
> Thanks,
> Gang
>
>
> --
> Xie Gang
>



-- 
Xie Gang


Inconsistence between the datanode volume info and OS df

2018-01-09 Thread Xie Gang
Hi,

Recently, we hit an issue that, there is a difference between the freeSpace
of the datanode volume info and the OS df:

For example:
the jmx of the dn shows:

 "VolumeInfo" :
"{\"/\":{\"freeSpace\":1445398864500,\"usedSpace\":228138206927,\"reservedSpace\":53687091200}}",

But the df shows:
/dev/sda 2146676656 253778008 1785508084 13% /...

There is about 400GB gap which is regarded as Non DFS used. And the most
strange thing is that, after I restart the dn process, the gap disappear.
And after some days, the gap shows again.

The yarn shared the same server of the dn and has some file cache. Could it
be related?

The direct cause is that the freeSpace from dn is quit different from the
available space from df. After tracking down the code, freeSpace of the dn
is from dirFile.getUsableSpace(). could it have some problem? Do we hit
this issue before?

Thanks,
Gang


-- 
Xie Gang


Re: does it make sense to get remaining space by sum all the ones of the datanode

2018-01-29 Thread Xie Gang
2.4 and 2.6:

public long getRemaining(StorageType t) {
  long remaining = 0;
  for(DatanodeStorageInfo s : getStorageInfos()) {
if (s.getStorageType() == t) {
  remaining += s.getRemaining();
}
  }
  return remaining;
}


On Mon, Jan 29, 2018 at 6:12 PM, Vinayakumar B <vinayakum...@apache.org>
wrote:

> in which version of Hadoop you are seeing this?
>
> -Vinay
>
> On 29 Jan 2018 3:26 pm, "Xie Gang" <xiegang...@gmail.com> wrote:
>
> Hello,
>
> We recently hit a issue that almost all the disk of the datanode got full
> even we configured the du .reserved.
>
> After tracking down the code, found that when we choose a target datanode
> and check if it's good candidate for block allocation (isGoodTarget()), it
> only checks if the total left space of all the volumes(the same type), not
> each volume. This logic makes the reservation of each volume useless.
> Is this a problem or do I have any misunderstanding?
>
> final long remaining = node.getRemaining(storage.getStorageType());
> if (requiredSize > remaining - scheduledSize) {
>   logNodeIsNotChosen(storage, "the node does not have enough "
>   + storage.getStorageType() + " space"
>   + " (required=" + requiredSize
>   + ", scheduled=" + scheduledSize
>   + ", remaining=" + remaining + ")");
>   stats.incrOverScheduled();
>   return false;
> }
>
>
>
> --
> Xie Gang
>



-- 
Xie Gang


does it make sense to get remaining space by sum all the ones of the datanode

2018-01-29 Thread Xie Gang
Hello,

We recently hit a issue that almost all the disk of the datanode got full
even we configured the du .reserved.

After tracking down the code, found that when we choose a target datanode
and check if it's good candidate for block allocation (isGoodTarget()), it
only checks if the total left space of all the volumes(the same type), not
each volume. This logic makes the reservation of each volume useless.
Is this a problem or do I have any misunderstanding?

final long remaining = node.getRemaining(storage.getStorageType());
if (requiredSize > remaining - scheduledSize) {
  logNodeIsNotChosen(storage, "the node does not have enough "
  + storage.getStorageType() + " space"
  + " (required=" + requiredSize
  + ", scheduled=" + scheduledSize
  + ", remaining=" + remaining + ")");
  stats.incrOverScheduled();
  return false;
}



-- 
Xie Gang


enable the SC local read to UC block to optimize the read perf

2018-02-01 Thread Xie Gang
Hello,

We disable the SC local read to UC block in current implementation. This is
not optimized for some application with access mode that it almost read
what's written recently. With this access pattern, the usage of the SC read
downgrades greatly and the perf is impacted.

An idea is to enable SC local read to the UC block. If the app could ensure
that
1. before each read of the file, hsync always is always called.
2. And the range of the read is not beyond the write.

It think if the 2 above met, there should be no problem. Does this make
sense?
Actually, I did a prototype and succeeded, and will look into it further.
But not sure if we tried this before.
-- 
Xie Gang


Why to set socket read timeout to n*socketTimeout in data transfer

2018-02-04 Thread Xie Gang
Hello,

I find when we transfer data (DataTransfer), we set the socket read timeout
to *targets.length * socketTimeout*, but not socketTimeout +
READ_TIMEOUT_EXTENSION
* (targets.length-1) as we do in the setting of the pipeline.

What's the purpose of this setting?


public void doRun() throws IOException {
  xmitsInProgress.getAndIncrement();
  Socket sock = null;
  DataOutputStream out = null;
  DataInputStream in = null;
  BlockSender blockSender = null;
  final boolean isClient = clientname.length() > 0;

  try {
final String dnAddr = targets[0].getXferAddr(connectToDnViaHostname);
InetSocketAddress curTarget = NetUtils.createSocketAddr(dnAddr);
if (LOG.isDebugEnabled()) {
  LOG.debug("Connecting to datanode " + dnAddr);
}
sock = newSocket();
NetUtils.connect(sock, curTarget, dnConf.socketTimeout);
*sock.setSoTimeout(targets.length * dnConf.socketTimeout);<<<<-*

long writeTimeout = dnConf.socketWriteTimeout +
HdfsServerConstants.WRITE_TIMEOUT_EXTENSION *
(targets.length-1);

-- 
Xie Gang


Why always allocate shm slot when local read even if no zero copy needed?

2018-02-08 Thread Xie Gang
Hello,

It seems that we always allocate shm slot when local read, even if we only
do the SC read without zero copy. Can we save this if no zero copy needed?

According to my understanding, the shm slot is not used if we don't do the
zero copy when local read, is that right?

public ShortCircuitReplica(ExtendedBlockId key,
FileInputStream dataStream, FileInputStream metaStream,
ShortCircuitCache cache, long creationTimeMs, Slot slot) throws
IOException {


-- 
Xie Gang


How to access 2 HDFS with difference version in one app

2017-12-28 Thread Xie Gang
Hi,

There is a requirement that we need access 2 HDFS with different
version(2.0 and 2.4). We shade 2.0 on the client side. But found that, due
to the shade (change of the package path), the namenode/datanode (without
any shade), could not handle the rpc with unkown package name.

So, is there any other way to do this?

The rough idea is to change the RPC engine to change the shaded package
name back to the original one. but not sure if it could work.


-- 
Xie Gang