[jira] [Created] (HDFS-17508) RBF: MembershipStateStore can overwrite valid records when refreshing the local cache

2024-04-29 Thread Danny Becker (Jira)
Danny Becker created HDFS-17508:
---

 Summary: RBF: MembershipStateStore can overwrite valid records 
when refreshing the local cache
 Key: HDFS-17508
 URL: https://issues.apache.org/jira/browse/HDFS-17508
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: rbf
Affects Versions: 3.3.4
Reporter: Danny Becker


When the MembershipStore refreshes its cache, it will also call the 
overrideExpiredRecords() method, which will write the Expired state back to the 
state store. Due to a race condition this logic can overwrite valid records. 
overrideExpiredRecords should first check if the state of the record is already 
set to "EXPIRED" before it writes the record to the state store.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17477) IncrementalBlockReport race condition additional edge cases

2024-04-24 Thread Danny Becker (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840537#comment-17840537
 ] 

Danny Becker commented on HDFS-17477:
-

[~ayushtkn] [~zhanghaobo] I have created a JIRA to address the memory leak 
introduced by this change 
[HDFS-17499|https://issues.apache.org/jira/browse/HDFS-17499]

> IncrementalBlockReport race condition additional edge cases
> ---
>
> Key: HDFS-17477
> URL: https://issues.apache.org/jira/browse/HDFS-17477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: auto-failover, ha, namenode
>Affects Versions: 3.3.5, 3.3.4, 3.3.6
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
>  Labels: pull-request-available
>
> HDFS-17453 fixes a race condition between IncrementalBlockReports (IBR) and 
> the Edit Log Tailer which can cause the Standby NameNode (SNN) to incorrectly 
> mark blocks as corrupt when it transitions to Active. There are a few edge 
> cases that HDFS-17453 does not cover.
> For Example:
> 1. SNN1 loads the edits for b1gs1 and b1gs2.
> 2. DN1 reports b1gs1 to SNN1, so it gets queued for later processing.
> 3. DN1 reports b1gs2 to SNN1 so it gets added to the blocks map.
> 4. SNN1 transitions to Active (ANN1).
> 5. ANN1 processes the pending DN message queue and marks DN1->b1gs1 as 
> corrupt because it was still in the queue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17499) removeQueuedBlock in PendingDataNodeMessages has memory leak

2024-04-24 Thread Danny Becker (Jira)
Danny Becker created HDFS-17499:
---

 Summary: removeQueuedBlock in PendingDataNodeMessages has memory 
leak
 Key: HDFS-17499
 URL: https://issues.apache.org/jira/browse/HDFS-17499
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Danny Becker


Introduced by HDFS-17477. The PendingDataNodeMessages#removeQueuedBlock() will 
create an empty list stored in queueByBlockId for every incremental block 
report processed by the BlockManager.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17477) IncrementalBlockReport race condition additional edge cases

2024-04-17 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker reassigned HDFS-17477:
---

Assignee: Danny Becker

> IncrementalBlockReport race condition additional edge cases
> ---
>
> Key: HDFS-17477
> URL: https://issues.apache.org/jira/browse/HDFS-17477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: auto-failover, ha, namenode
>Affects Versions: 3.3.5, 3.3.4, 3.3.6
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
>
> HDFS-17453 fixes a race condition between IncrementalBlockReports (IBR) and 
> the Edit Log Tailer which can cause the Standby NameNode (SNN) to incorrectly 
> mark blocks as corrupt when it transitions to Active. There are a few edge 
> cases that HDFS-17453 does not cover.
> For Example:
> 1. SNN1 loads the edits for b1gs1 and b1gs2.
> 2. DN1 reports b1gs1 to SNN1, so it gets queued for later processing.
> 3. DN1 reports b1gs2 to SNN1 so it gets added to the blocks map.
> 4. SNN1 transitions to Active (ANN1).
> 5. ANN1 processes the pending DN message queue and marks DN1->b1gs1 as 
> corrupt because it was still in the queue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17477) IncrementalBlockReport race condition additional edge cases

2024-04-17 Thread Danny Becker (Jira)
Danny Becker created HDFS-17477:
---

 Summary: IncrementalBlockReport race condition additional edge 
cases
 Key: HDFS-17477
 URL: https://issues.apache.org/jira/browse/HDFS-17477
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha, namenode
Affects Versions: 3.3.6, 3.3.4, 3.3.5
Reporter: Danny Becker


HDFS-17453 fixes a race condition between IncrementalBlockReports (IBR) and the 
Edit Log Tailer which can cause the Standby NameNode (SNN) to incorrectly mark 
blocks as corrupt when it transitions to Active. There are a few edge cases 
that HDFS-17453 does not cover.

For Example:
1. SNN1 loads the edits for b1gs1 and b1gs2.
2. DN1 reports b1gs1 to SNN1, so it gets queued for later processing.
3. DN1 reports b1gs2 to SNN1 so it gets added to the blocks map.
4. SNN1 transitions to Active (ANN1).
5. ANN1 processes the pending DN message queue and marks DN1->b1gs1 as corrupt 
because it was still in the queue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17453) IncrementalBlockReport can have race condition with Edit Log Tailer

2024-04-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17453:

Description: 
h2. Summary

There is a race condition between IncrementalBlockReports (IBR) and 
EditLogTailer in Standby NameNode (SNN) which can lead to leaked IBRs and false 
corrupt blocks after HA Failover. The race condition occurs when the SNN loads 
the edit logs before it receives the block reports from DataNode (DN).
h2. Example

In the following example there is a block (b1) with 3 generation stamps (gs1, 
gs2, gs3).
 # SNN1 loads edit logs for b1gs1 and b1gs2.
 # DN1 sends the IBR for b1gs1 to SNN1.
 # SNN1 will determine that the reported block b1gs1 from DN1 is corrupt and it 
will be queued for later. 
[BlockManager.java|https://github.com/apache/hadoop/blob/6ed73896f6e8b4b7c720eff64193cb30b3e77fb2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3447C1-L3464C6]
{code:java}
    BlockToMarkCorrupt c = checkReplicaCorrupt(
        block, reportedState, storedBlock, ucState, dn);
    if (c != null) {
      if (shouldPostponeBlocksFromFuture) {
        // If the block is an out-of-date generation stamp or state,
        // but we're the standby, we shouldn't treat it as corrupt,
        // but instead just queue it for later processing.
        // Storing the reported block for later processing, as that is what
        // comes from the IBR / FBR and hence what we should use to compare
        // against the memory state.
        // See HDFS-6289 and HDFS-15422 for more context.
        queueReportedBlock(storageInfo, block, reportedState,
            QUEUE_REASON_CORRUPT_STATE);
      } else {
        toCorrupt.add(c);
      }
      return storedBlock;
    } {code}

 # DN1 sends IBR for b1gs2 and b1gs3 to SNN1.
 # SNN1 processes b1sg2 and updates the blocks map.
 # SNN1 queues b1gs3 for later because it determines that b1gs3 is a future 
genstamp.
 # SNN1 loads b1gs3 edit logs and processes the queued reports for b1.
 # SNN1 processes b1gs1 first and puts it back in the queue.
 # SNN1 processes b1gs3 next and updates the blocks map.
 # Later, SNN1 becomes the Active NameNode (ANN) during an HA Failover.
 # SNN1 will catch to the latest edit logs, then process all queued block 
reports to become the ANN.
 # ANN1 will process b1gs1 and mark it as corrupt.

If the example above happens for every DN which stores b1, then when the HA 
failover happens, b1 will be incorrectly marked as corrupt. This will be fixed 
when the first DN sends a FullBlockReport or an IBR for b1.
h2. Logs from Active Cluster

I added the following logs to confirm this issue in an active cluster:
{code:java}
BlockToMarkCorrupt c = checkReplicaCorrupt(
block, reportedState, storedBlock, ucState, dn);
if (c != null) {
  DatanodeStorageInfo storedStorageInfo = storedBlock.findStorageInfo(dn);
  LOG.info("Found corrupt block {} [{}, {}] from DN {}. Stored block {} from DN 
{}",
  block, reportedState.name(), ucState.name(), storageInfo, storedBlock, 
storedStorageInfo);
  if (storageInfo.equals(storedStorageInfo) &&
storedBlock.getGenerationStamp() > block.getGenerationStamp()) {
LOG.info("Stored Block {} from the same DN {} has a newer GenStamp." +
storedBlock, storedStorageInfo);
  }
  if (shouldPostponeBlocksFromFuture) {
// If the block is an out-of-date generation stamp or state,
// but we're the standby, we shouldn't treat it as corrupt,
// but instead just queue it for later processing.
// Storing the reported block for later processing, as that is what
// comes from the IBR / FBR and hence what we should use to compare
// against the memory state.
// See HDFS-6289 and HDFS-15422 for more context.
queueReportedBlock(storageInfo, block, reportedState,
QUEUE_REASON_CORRUPT_STATE);
LOG.info("Queueing the block {} for later processing", block);
  } else {
toCorrupt.add(c);
LOG.info("Marking the block {} as corrupt", block);
  }
  return storedBlock;
} {code}
 

Logs from nn1 (Active):
{code:java}
2024-04-03T03:00:52.524-0700,INFO,[IPC Server handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634,
 newGS=65700925027, newLength=10485760, newNodes=[[DN1]:10010, [DN2]:10010, 
[DN3]:10010, client=client1)"
2024-04-03T03:00:52.539-0700,INFO,[IPC Server handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634
 => blk_66092666802_65700925027) success"
2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027,
 newGS=65700933553, newLength=20971520, newNodes=[[DN1]:10010, [DN2]:10010, 
[DN3]:10010, client=client1)"

[jira] [Commented] (HDFS-17453) IncrementalBlockReport can have race condition with Edit Log Tailer

2024-04-04 Thread Danny Becker (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834063#comment-17834063
 ] 

Danny Becker commented on HDFS-17453:
-

These JIRAs mention the same race condition.

> IncrementalBlockReport can have race condition with Edit Log Tailer
> ---
>
> Key: HDFS-17453
> URL: https://issues.apache.org/jira/browse/HDFS-17453
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: auto-failover, ha, hdfs, namenode
>Affects Versions: 3.3.0, 3.3.1, 2.10.2, 3.3.2, 3.3.5, 3.3.4, 3.3.6
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
>
> h2. Summary
> There is a race condition between IncrementalBlockReports (IBR) and 
> EditLogTailer in Standby NameNode (SNN) which can lead to leaked IBRs and 
> false corrupt blocks after HA Failover. The race condition occurs when the 
> SNN loads the edit logs before it receives the block reports from DataNode 
> (DN).
> h2. Example
> In the following example there is a block (b1) with 3 generation stamps (gs1, 
> gs2, gs3).
>  # SNN1 loads edit logs for b1gs1 and b1gs2.
>  # DN1 sends the IBR for b1gs1 to SNN1.
>  # SNN1 will determine that the reported block b1gs1 from DN1 is corrupt and 
> it will be queued for later. 
> [BlockManager.java|https://github.com/apache/hadoop/blob/6ed73896f6e8b4b7c720eff64193cb30b3e77fb2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3447C1-L3464C6]
> {code:java}
>     BlockToMarkCorrupt c = checkReplicaCorrupt(
>         block, reportedState, storedBlock, ucState, dn);
>     if (c != null) {
>       if (shouldPostponeBlocksFromFuture) {
>         // If the block is an out-of-date generation stamp or state,
>         // but we're the standby, we shouldn't treat it as corrupt,
>         // but instead just queue it for later processing.
>         // Storing the reported block for later processing, as that is what
>         // comes from the IBR / FBR and hence what we should use to compare
>         // against the memory state.
>         // See HDFS-6289 and HDFS-15422 for more context.
>         queueReportedBlock(storageInfo, block, reportedState,
>             QUEUE_REASON_CORRUPT_STATE);
>       } else {
>         toCorrupt.add(c);
>       }
>       return storedBlock;
>     } {code}
>  # DN1 sends IBR for b1gs2 and b1gs3 to SNN1.
>  # SNN1 processes b1sg2 and updates the blocks map.
>  # SNN1 queues b1gs3 for later because it determines that b1gs3 is a future 
> genstamp.
>  # SNN1 loads b1gs3 edit logs and processes the queued reports for b1.
>  # SNN1 processes b1gs1 first and puts it back in the queue.
>  # SNN1 processes b1gs3 next and updates the blocks map.
>  # Later, SNN1 becomes the Active NameNode (ANN) during an HA Failover.
>  # SNN1 will catch to the latest edit logs, then process all queued block 
> reports to become the ANN.
>  # ANN1 will process b1gs1 and mark it as corrupt.
> If the example above happens for every DN which stores b1, then when the HA 
> failover happens, b1 will be incorrectly marked as corrupt. This will be 
> fixed when the first DN sends a FullBlockReport or an IBR for b1.
> h2. Logs from Active Cluster
> I added the following logs to confirm this issue in an active cluster:
> {code:java}
> BlockToMarkCorrupt c = checkReplicaCorrupt(
> block, reportedState, storedBlock, ucState, dn);
> if (c != null) {
>   DatanodeStorageInfo storedStorageInfo = storedBlock.findStorageInfo(dn);
>   LOG.info("Found corrupt block {} [{}, {}] from DN {}. Stored block {} from 
> DN {}",
>   block, reportedState.name(), ucState.name(), storageInfo, storedBlock, 
> storedStorageInfo);
>   if (storageInfo.equals(storedStorageInfo) &&
> storedBlock.getGenerationStamp() > block.getGenerationStamp()) {
> LOG.info("Stored Block {} from the same DN {} has a newer GenStamp." +
> storedBlock, storedStorageInfo);
>   }
>   if (shouldPostponeBlocksFromFuture) {
> // If the block is an out-of-date generation stamp or state,
> // but we're the standby, we shouldn't treat it as corrupt,
> // but instead just queue it for later processing.
> // Storing the reported block for later processing, as that is what
> // comes from the IBR / FBR and hence what we should use to compare
> // against the memory state.
> // See HDFS-6289 and HDFS-15422 for more context.
> queueReportedBlock(storageInfo, block, reportedState,
> QUEUE_REASON_CORRUPT_STATE);
> LOG.info("Queueing the block {} for later processing", block);
>   } else {
> toCorrupt.add(c);
> LOG.info("Marking the block {} as corrupt", block);
>   }
>   return storedBlock;
> } {code}
>  
> Logs from nn1 (Active):
> {code:java}
> 

[jira] [Updated] (HDFS-17453) IncrementalBlockReport can have race condition with Edit Log Tailer

2024-04-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17453:

Description: 
h2. Summary

There is a race condition between IncrementalBlockReports (IBR) and 
EditLogTailer in Standby NameNode (SNN) which can lead to leaked IBRs and false 
corrupt blocks after HA Failover. The race condition occurs when the SNN loads 
the edit logs before it receives the block reports from DataNode (DN).
h2. Example

In the following example there is a block (b1) with 3 generation stamps (gs1, 
gs2, gs3).
 # SNN1 loads edit logs for b1gs1 and b1gs2.
 # DN1 sends the IBR for b1gs1 to SNN1.
 # SNN1 will determine that the reported block b1gs1 from DN1 is corrupt and it 
will be queued for later. 
[BlockManager.java|https://github.com/apache/hadoop/blob/6ed73896f6e8b4b7c720eff64193cb30b3e77fb2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3447C1-L3464C6]
{code:java}
    BlockToMarkCorrupt c = checkReplicaCorrupt(
        block, reportedState, storedBlock, ucState, dn);
    if (c != null) {
      if (shouldPostponeBlocksFromFuture) {
        // If the block is an out-of-date generation stamp or state,
        // but we're the standby, we shouldn't treat it as corrupt,
        // but instead just queue it for later processing.
        // Storing the reported block for later processing, as that is what
        // comes from the IBR / FBR and hence what we should use to compare
        // against the memory state.
        // See HDFS-6289 and HDFS-15422 for more context.
        queueReportedBlock(storageInfo, block, reportedState,
            QUEUE_REASON_CORRUPT_STATE);
      } else {
        toCorrupt.add(c);
      }
      return storedBlock;
    } {code}

 # DN1 sends IBR for b1gs2 and b1gs3 to SNN1.
 # SNN1 processes b1sg2 and updates the blocks map.
 # SNN1 queues b1gs3 for later because it determines that b1gs3 is a future 
genstamp.
 # SNN1 loads b1gs3 edit logs and processes the queued reports for b1.
 # SNN1 processes b1gs1 first and puts it back in the queue.
 # SNN1 processes b1gs3 next and updates the blocks map.
 # Later, SNN1 becomes the Active NameNode (ANN) during an HA Failover.
 # SNN1 will catch to the latest edit logs, then process all queued block 
reports to become the ANN.
 # ANN1 will process b1gs1 and mark it as corrupt.

If the example above happens for every DN which stores b1, then when the HA 
failover happens, b1 will be incorrectly marked as corrupt. This will be fixed 
when the first DN sends a FullBlockReport or an IBR for b1.
h2. Logs from Active Cluster

I added the following logs to confirm this issue in an active cluster:
{code:java}
BlockToMarkCorrupt c = checkReplicaCorrupt(
block, reportedState, storedBlock, ucState, dn);
if (c != null) {
  DatanodeStorageInfo storedStorageInfo = storedBlock.findStorageInfo(dn);
  LOG.info("Found corrupt block {} [{}, {}] from DN {}. Stored block {} from DN 
{}",
  block, reportedState.name(), ucState.name(), storageInfo, storedBlock, 
storedStorageInfo);
  if (storageInfo.equals(storedStorageInfo) &&
storedBlock.getGenerationStamp() > block.getGenerationStamp()) {
LOG.info("Stored Block {} from the same DN {} has a newer GenStamp." +
storedBlock, storedStorageInfo);
  }
  if (shouldPostponeBlocksFromFuture) {
// If the block is an out-of-date generation stamp or state,
// but we're the standby, we shouldn't treat it as corrupt,
// but instead just queue it for later processing.
// Storing the reported block for later processing, as that is what
// comes from the IBR / FBR and hence what we should use to compare
// against the memory state.
// See HDFS-6289 and HDFS-15422 for more context.
queueReportedBlock(storageInfo, block, reportedState,
QUEUE_REASON_CORRUPT_STATE);
LOG.info("Queueing the block {} for later processing", block);
  } else {
toCorrupt.add(c);
LOG.info("Marking the block {} as corrupt", block);
  }
  return storedBlock;
} {code}
 

Logs from nn1 (Active):
{code:java}
2024-04-03T03:00:52.524-0700,INFO,[IPC Server handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634,
 newGS=65700925027, newLength=10485760, newNodes=[[DN1]:10010, [DN2]:10010, 
[DN3]:10010, client=client1)" 2024-04-03T03:00:52.539-0700,INFO,[IPC Server 
handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634
 => blk_66092666802_65700925027) success" 
2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027,
 newGS=65700933553, newLength=20971520, newNodes=[[DN1]:10010, [DN2]:10010, 
[DN3]:10010, client=client1)" 

[jira] [Updated] (HDFS-17453) IncrementalBlockReport can have race condition with Edit Log Tailer

2024-04-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17453:

Description: 
h2. Summary

There is a race condition between IncrementalBlockReports (IBR) and 
EditLogTailer in Standby NameNode (SNN) which can lead to leaked IBRs and false 
corrupt blocks after HA Failover. The race condition occurs when the SNN loads 
the edit logs before it receives the block reports from DataNode (DN).
h2. Example

In the following example there is a block (b1) with 3 generation stamps (gs1, 
gs2, gs3).
 # SNN1 loads edit logs for b1gs1 and b1gs2.
 # DN1 sends the IBR for b1gs1 to SNN1.
 # SNN1 will determine that the reported block b1gs1 from DN1 is corrupt and it 
will be queued for later. 
[BlockManager.java|https://github.com/apache/hadoop/blob/6ed73896f6e8b4b7c720eff64193cb30b3e77fb2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3447C1-L3464C6]
{code:java}
    BlockToMarkCorrupt c = checkReplicaCorrupt(
        block, reportedState, storedBlock, ucState, dn);
    if (c != null) {
      if (shouldPostponeBlocksFromFuture) {
        // If the block is an out-of-date generation stamp or state,
        // but we're the standby, we shouldn't treat it as corrupt,
        // but instead just queue it for later processing.
        // Storing the reported block for later processing, as that is what
        // comes from the IBR / FBR and hence what we should use to compare
        // against the memory state.
        // See HDFS-6289 and HDFS-15422 for more context.
        queueReportedBlock(storageInfo, block, reportedState,
            QUEUE_REASON_CORRUPT_STATE);
      } else {
        toCorrupt.add(c);
      }
      return storedBlock;
    } {code}

 # DN1 sends IBR for b1gs2 and b1gs3 to SNN1.
 # SNN1 processes b1sg2 and updates the blocks map.
 # SNN1 queues b1gs3 for later because it determines that b1gs3 is a future 
genstamp.
 # SNN1 loads b1gs3 edit logs and processes the queued reports for b1.
 # SNN1 processes b1gs1 first and puts it back in the queue.
 # SNN1 processes b1gs3 next and updates the blocks map.
 # Later, SNN1 becomes the Active NameNode (ANN) during an HA Failover.
 # SNN1 will catch to the latest edit logs, then process all queued block 
reports to become the ANN.
 # ANN1 will process b1gs1 and mark it as corrupt.

If the example above happens for every DN which stores b1, then when the HA 
failover happens, b1 will be incorrectly marked as corrupt. This will be fixed 
when the first DN sends a FullBlockReport or an IBR for b1.
h2. Logs from Active Cluster

I added the following logs to confirm this issue in an active cluster:
{code:java}
BlockToMarkCorrupt c = checkReplicaCorrupt(
block, reportedState, storedBlock, ucState, dn);
if (c != null) {
  DatanodeStorageInfo storedStorageInfo = storedBlock.findStorageInfo(dn);
  LOG.info("Found corrupt block {} [{}, {}] from DN {}. Stored block {} from DN 
{}",
  block, reportedState.name(), ucState.name(), storageInfo, storedBlock, 
storedStorageInfo);
  if (storageInfo.equals(storedStorageInfo) &&
storedBlock.getGenerationStamp() > block.getGenerationStamp()) {
LOG.info("Stored Block {} from the same DN {} has a newer GenStamp." +
storedBlock, storedStorageInfo);
  }
  if (shouldPostponeBlocksFromFuture) {
// If the block is an out-of-date generation stamp or state,
// but we're the standby, we shouldn't treat it as corrupt,
// but instead just queue it for later processing.
// Storing the reported block for later processing, as that is what
// comes from the IBR / FBR and hence what we should use to compare
// against the memory state.
// See HDFS-6289 and HDFS-15422 for more context.
queueReportedBlock(storageInfo, block, reportedState,
QUEUE_REASON_CORRUPT_STATE);
LOG.info("Queueing the block {} for later processing", block);
  } else {
toCorrupt.add(c);
LOG.info("Marking the block {} as corrupt", block);
  }
  return storedBlock;
} {code}
Logs from nn1 (Active):

 
{code:java}
2024-04-03T03:00:52.524-0700,INFO,[IPC Server handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634,
 newGS=65700925027, newLength=10485760, newNodes=[[DN1]:10010, [DN2]:10010, 
[DN3]:10010, client=client1)" 2024-04-03T03:00:52.539-0700,INFO,[IPC Server 
handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634
 => blk_66092666802_65700925027) success" 
2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027,
 newGS=65700933553, newLength=20971520, newNodes=[[DN1]:10010, [DN2]:10010, 
[DN3]:10010, client=client1)" 

[jira] [Created] (HDFS-17453) IncrementalBlockReport can have race condition with Edit Log Tailer

2024-04-04 Thread Danny Becker (Jira)
Danny Becker created HDFS-17453:
---

 Summary: IncrementalBlockReport can have race condition with Edit 
Log Tailer
 Key: HDFS-17453
 URL: https://issues.apache.org/jira/browse/HDFS-17453
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: auto-failover, ha, hdfs, namenode
Affects Versions: 3.3.6, 3.3.4, 3.3.5, 3.3.2, 2.10.2, 3.3.1, 3.3.0
Reporter: Danny Becker
Assignee: Danny Becker


h2. Summary

There is a race condition between IncrementalBlockReports (IBR) and 
EditLogTailer in Standby NameNode (SNN) which can lead to leaked IBRs and false 
corrupt blocks after HA Failover. The race condition occurs when the SNN loads 
the edit logs before it receives the block reports from DataNode (DN).
h2. Example

In the following example there is a block (b1) with 3 generation stamps (gs1, 
gs2, gs3).
 # SNN1 loads edit logs for b1gs1 and b1gs2.
 # DN1 sends the IBR for b1gs1 to SNN1.
 # SNN1 will determine that the reported block b1gs1 from DN1 is corrupt and it 
will be queued for later. 
[BlockManager.java|https://github.com/apache/hadoop/blob/6ed73896f6e8b4b7c720eff64193cb30b3e77fb2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3447C1-L3464C6]
{code:java}
    BlockToMarkCorrupt c = checkReplicaCorrupt(
        block, reportedState, storedBlock, ucState, dn);
    if (c != null) {
      if (shouldPostponeBlocksFromFuture) {
        // If the block is an out-of-date generation stamp or state,
        // but we're the standby, we shouldn't treat it as corrupt,
        // but instead just queue it for later processing.
        // Storing the reported block for later processing, as that is what
        // comes from the IBR / FBR and hence what we should use to compare
        // against the memory state.
        // See HDFS-6289 and HDFS-15422 for more context.
        queueReportedBlock(storageInfo, block, reportedState,
            QUEUE_REASON_CORRUPT_STATE);
      } else {
        toCorrupt.add(c);
      }
      return storedBlock;
    } {code}

 # DN1 sends IBR for b1gs2 and b1gs3 to SNN1.
 # SNN1 processes b1sg2 and updates the blocks map.
 # SNN1 queues b1gs3 for later because it determines that b1gs3 is a future 
genstamp.
 # SNN1 loads b1gs3 edit logs and processes the queued reports for b1.
 # SNN1 processes b1gs1 first and puts it back in the queue.
 # SNN1 processes b1gs3 next and updates the blocks map.
 # Later, SNN1 becomes the Active NameNode (ANN) during an HA Failover.
 # SNN1 will catch to the latest edit logs, then process all queued block 
reports to become the ANN.
 # ANN1 will process b1gs1 and mark it as corrupt.

If the example above happens for every DN which stores b1, then when the HA 
failover happens, b1 will be incorrectly marked as corrupt. This will be fixed 
when the first DN sends a FullBlockReport or an IBR for b1.
h2. Logs from Active Cluster

I added the following logs to confirm this issue in an active cluster:

 
{code:java}
BlockToMarkCorrupt c = checkReplicaCorrupt(
block, reportedState, storedBlock, ucState, dn);
if (c != null) {
  DatanodeStorageInfo storedStorageInfo = storedBlock.findStorageInfo(dn);
  LOG.info("Found corrupt block {} [{}, {}] from DN {}. Stored block {} from DN 
{}",
  block, reportedState.name(), ucState.name(), storageInfo, storedBlock, 
storedStorageInfo);
  if (storageInfo.equals(storedStorageInfo) &&
storedBlock.getGenerationStamp() > block.getGenerationStamp()) {
LOG.info("Stored Block {} from the same DN {} has a newer GenStamp." +
storedBlock, storedStorageInfo);
  }
  if (shouldPostponeBlocksFromFuture) {
// If the block is an out-of-date generation stamp or state,
// but we're the standby, we shouldn't treat it as corrupt,
// but instead just queue it for later processing.
// Storing the reported block for later processing, as that is what
// comes from the IBR / FBR and hence what we should use to compare
// against the memory state.
// See HDFS-6289 and HDFS-15422 for more context.
queueReportedBlock(storageInfo, block, reportedState,
QUEUE_REASON_CORRUPT_STATE);
LOG.info("Queueing the block {} for later processing", block);
  } else {
toCorrupt.add(c);
LOG.info("Marking the block {} as corrupt", block);
  }
  return storedBlock;
} {code}
Logs from nn1 (Active):

 
{code:java}
2024-04-03T03:00:52.524-0700,INFO,[IPC Server handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634,
 newGS=65700925027, newLength=10485760, newNodes=[[DN1]:10010, [DN2]:10010, 
[DN3]:10010, client=client1)" 2024-04-03T03:00:52.539-0700,INFO,[IPC Server 
handler 6 on default port 
443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634
 => blk_66092666802_65700925027) 

[jira] [Updated] (HDFS-17167) Observer NameNode -observer startup option conflicts with -rollingUpgrade startup option

2023-09-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17167:

Attachment: (was: HDFS-17167.001.patch)

> Observer NameNode -observer startup option conflicts with -rollingUpgrade 
> startup option
> 
>
> Key: HDFS-17167
> URL: https://issues.apache.org/jira/browse/HDFS-17167
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Fix For: 3.4.0
>
>
> NameNode currently uses a "StartupOption" to decide whether to start a node 
> as an Observer NameNode. This causes an issue during a rolling upgrade 
> because "rollingUpgrade" is also a StartupOption and NameNode will only allow 
> 1 startup option, choosing the last startup option in the list. Observer in 
> our environment starts with the following startup options: 
> ["-rollingUpgrade", "started", "-observer"]. This means that the rolling 
> upgrade gets ignored which causes Observer to have an issue when an actual 
> rolling upgrade is ongoing:
> {code:java}
> 2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
>  exception loading fsimage
> java.io.IOException: 
> File system image contains an old layout version -63.
> An upgrade to version -66 is required.
> Please restart NameNode with the \'-rollingUpgrade started\' option if a 
> rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
> option to start a new upgrade.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
> "{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17167) Observer NameNode -observer startup option conflicts with -rollingUpgrade startup option

2023-09-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17167:

Fix Version/s: 3.4.0

> Observer NameNode -observer startup option conflicts with -rollingUpgrade 
> startup option
> 
>
> Key: HDFS-17167
> URL: https://issues.apache.org/jira/browse/HDFS-17167
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Fix For: 3.4.0
>
>
> NameNode currently uses a "StartupOption" to decide whether to start a node 
> as an Observer NameNode. This causes an issue during a rolling upgrade 
> because "rollingUpgrade" is also a StartupOption and NameNode will only allow 
> 1 startup option, choosing the last startup option in the list. Observer in 
> our environment starts with the following startup options: 
> ["-rollingUpgrade", "started", "-observer"]. This means that the rolling 
> upgrade gets ignored which causes Observer to have an issue when an actual 
> rolling upgrade is ongoing:
> {code:java}
> 2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
>  exception loading fsimage
> java.io.IOException: 
> File system image contains an old layout version -63.
> An upgrade to version -66 is required.
> Please restart NameNode with the \'-rollingUpgrade started\' option if a 
> rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
> option to start a new upgrade.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
> "{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17178) BootstrapStandby needs to handle RollingUpgrade

2023-09-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17178:

Fix Version/s: 3.4.0
   (was: 3.3.6)

> BootstrapStandby needs to handle RollingUpgrade 
> 
>
> Key: HDFS-17178
> URL: https://issues.apache.org/jira/browse/HDFS-17178
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Fix For: 3.4.0
>
>
> During rollingUpgrade, bootstrapStandby will fail with an exception due to 
> different NameNodeLayoutVersions. We can ignore this safely during 
> RollingUpgrade because different NameNodeLayoutVersions are expected.
>  * NameNodes will not be able to recover with BootstrapStandby if they go 
> through destructive repair before the rollingUpgrade has been finalized.
> Error during BootstrapStandby before change:
> {code:java}
> =
> About to bootstrap Standby ID nn2 from:
>Nameservice ID: MTPrime-MWHE01-0
> Other Namenode ID: nn1
>   Other NN's HTTP address: https://MWHEEEAP002D9A2:81
>   Other NN's IPC  address: MWHEEEAP002D9A2.ap.gbl/10.59.208.18:8020
>  Namespace ID: 895912530
> Block pool ID: BP-1556042256-10.99.154.61-1663325602669
>Cluster ID: MWHE01
>Layout version: -64
>isUpgradeFinalized: true
> =
> 2023-08-28T19:35:06,940 ERROR [main] namenode.NameNode: Failed to start 
> namenode.
> java.io.IOException: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at 
> https://MWHEEEAP002D9A2:81/imagetransfer?getimage=1=25683470=-64:895912530:1663325602669:MWHE01=true
>  failed with status code 403
> Response message:
> This namenode has storage info -63:895912530:1663325602669:MWHE01 but the 
> secondary expected -64:895912530:1663325602669:MWHE01
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:583)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1717)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1819) 
> [hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at https://MWHEEEAP002D9A2:81{code}
> This is caused because the namespaceInfo sent from the proxy node does not 
> include the effective layout version, which causes BootstrapStandby to send a 
> request with a storageinfo param using the service layout version. This 
> causes the proxy node to refuse the request, because it compares the 
> storageinfo param against its storage info, which uses the effective layout 
> version, not the service layout version. 
> To fix this we can modify the proxy.versionRequest() call stack to set the 
> layout version using the effective layout version on the proxy node. We can 
> then add logic to BootstrapStandby to properly handle the case where the 
> proxy node is in rolling upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17178) BootstrapStandby needs to handle RollingUpgrade

2023-09-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17178:

Target Version/s: 3.3.9  (was: 3.3.4)

> BootstrapStandby needs to handle RollingUpgrade 
> 
>
> Key: HDFS-17178
> URL: https://issues.apache.org/jira/browse/HDFS-17178
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Fix For: 3.3.6
>
>
> During rollingUpgrade, bootstrapStandby will fail with an exception due to 
> different NameNodeLayoutVersions. We can ignore this safely during 
> RollingUpgrade because different NameNodeLayoutVersions are expected.
>  * NameNodes will not be able to recover with BootstrapStandby if they go 
> through destructive repair before the rollingUpgrade has been finalized.
> Error during BootstrapStandby before change:
> {code:java}
> =
> About to bootstrap Standby ID nn2 from:
>Nameservice ID: MTPrime-MWHE01-0
> Other Namenode ID: nn1
>   Other NN's HTTP address: https://MWHEEEAP002D9A2:81
>   Other NN's IPC  address: MWHEEEAP002D9A2.ap.gbl/10.59.208.18:8020
>  Namespace ID: 895912530
> Block pool ID: BP-1556042256-10.99.154.61-1663325602669
>Cluster ID: MWHE01
>Layout version: -64
>isUpgradeFinalized: true
> =
> 2023-08-28T19:35:06,940 ERROR [main] namenode.NameNode: Failed to start 
> namenode.
> java.io.IOException: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at 
> https://MWHEEEAP002D9A2:81/imagetransfer?getimage=1=25683470=-64:895912530:1663325602669:MWHE01=true
>  failed with status code 403
> Response message:
> This namenode has storage info -63:895912530:1663325602669:MWHE01 but the 
> secondary expected -64:895912530:1663325602669:MWHE01
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:583)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1717)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1819) 
> [hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at https://MWHEEEAP002D9A2:81{code}
> This is caused because the namespaceInfo sent from the proxy node does not 
> include the effective layout version, which causes BootstrapStandby to send a 
> request with a storageinfo param using the service layout version. This 
> causes the proxy node to refuse the request, because it compares the 
> storageinfo param against its storage info, which uses the effective layout 
> version, not the service layout version. 
> To fix this we can modify the proxy.versionRequest() call stack to set the 
> layout version using the effective layout version on the proxy node. We can 
> then add logic to BootstrapStandby to properly handle the case where the 
> proxy node is in rolling upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17167) Observer NameNode -observer startup option conflicts with -rollingUpgrade startup option

2023-09-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17167:

Target Version/s: 3.4.0  (was: 3.3.4)

> Observer NameNode -observer startup option conflicts with -rollingUpgrade 
> startup option
> 
>
> Key: HDFS-17167
> URL: https://issues.apache.org/jira/browse/HDFS-17167
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-17167.001.patch
>
>
> NameNode currently uses a "StartupOption" to decide whether to start a node 
> as an Observer NameNode. This causes an issue during a rolling upgrade 
> because "rollingUpgrade" is also a StartupOption and NameNode will only allow 
> 1 startup option, choosing the last startup option in the list. Observer in 
> our environment starts with the following startup options: 
> ["-rollingUpgrade", "started", "-observer"]. This means that the rolling 
> upgrade gets ignored which causes Observer to have an issue when an actual 
> rolling upgrade is ongoing:
> {code:java}
> 2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
>  exception loading fsimage
> java.io.IOException: 
> File system image contains an old layout version -63.
> An upgrade to version -66 is required.
> Please restart NameNode with the \'-rollingUpgrade started\' option if a 
> rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
> option to start a new upgrade.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
> "{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17178) BootstrapStandby needs to handle RollingUpgrade

2023-09-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17178:

Target Version/s: 3.4.0  (was: 3.3.9)

> BootstrapStandby needs to handle RollingUpgrade 
> 
>
> Key: HDFS-17178
> URL: https://issues.apache.org/jira/browse/HDFS-17178
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Fix For: 3.3.6
>
>
> During rollingUpgrade, bootstrapStandby will fail with an exception due to 
> different NameNodeLayoutVersions. We can ignore this safely during 
> RollingUpgrade because different NameNodeLayoutVersions are expected.
>  * NameNodes will not be able to recover with BootstrapStandby if they go 
> through destructive repair before the rollingUpgrade has been finalized.
> Error during BootstrapStandby before change:
> {code:java}
> =
> About to bootstrap Standby ID nn2 from:
>Nameservice ID: MTPrime-MWHE01-0
> Other Namenode ID: nn1
>   Other NN's HTTP address: https://MWHEEEAP002D9A2:81
>   Other NN's IPC  address: MWHEEEAP002D9A2.ap.gbl/10.59.208.18:8020
>  Namespace ID: 895912530
> Block pool ID: BP-1556042256-10.99.154.61-1663325602669
>Cluster ID: MWHE01
>Layout version: -64
>isUpgradeFinalized: true
> =
> 2023-08-28T19:35:06,940 ERROR [main] namenode.NameNode: Failed to start 
> namenode.
> java.io.IOException: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at 
> https://MWHEEEAP002D9A2:81/imagetransfer?getimage=1=25683470=-64:895912530:1663325602669:MWHE01=true
>  failed with status code 403
> Response message:
> This namenode has storage info -63:895912530:1663325602669:MWHE01 but the 
> secondary expected -64:895912530:1663325602669:MWHE01
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:583)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1717)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1819) 
> [hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at https://MWHEEEAP002D9A2:81{code}
> This is caused because the namespaceInfo sent from the proxy node does not 
> include the effective layout version, which causes BootstrapStandby to send a 
> request with a storageinfo param using the service layout version. This 
> causes the proxy node to refuse the request, because it compares the 
> storageinfo param against its storage info, which uses the effective layout 
> version, not the service layout version. 
> To fix this we can modify the proxy.versionRequest() call stack to set the 
> layout version using the effective layout version on the proxy node. We can 
> then add logic to BootstrapStandby to properly handle the case where the 
> proxy node is in rolling upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17178) BootstrapStandby needs to handle RollingUpgrade

2023-09-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17178:

Fix Version/s: 3.3.6
   (was: 3.3.4)

> BootstrapStandby needs to handle RollingUpgrade 
> 
>
> Key: HDFS-17178
> URL: https://issues.apache.org/jira/browse/HDFS-17178
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Fix For: 3.3.6
>
>
> During rollingUpgrade, bootstrapStandby will fail with an exception due to 
> different NameNodeLayoutVersions. We can ignore this safely during 
> RollingUpgrade because different NameNodeLayoutVersions are expected.
>  * NameNodes will not be able to recover with BootstrapStandby if they go 
> through destructive repair before the rollingUpgrade has been finalized.
> Error during BootstrapStandby before change:
> {code:java}
> =
> About to bootstrap Standby ID nn2 from:
>Nameservice ID: MTPrime-MWHE01-0
> Other Namenode ID: nn1
>   Other NN's HTTP address: https://MWHEEEAP002D9A2:81
>   Other NN's IPC  address: MWHEEEAP002D9A2.ap.gbl/10.59.208.18:8020
>  Namespace ID: 895912530
> Block pool ID: BP-1556042256-10.99.154.61-1663325602669
>Cluster ID: MWHE01
>Layout version: -64
>isUpgradeFinalized: true
> =
> 2023-08-28T19:35:06,940 ERROR [main] namenode.NameNode: Failed to start 
> namenode.
> java.io.IOException: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at 
> https://MWHEEEAP002D9A2:81/imagetransfer?getimage=1=25683470=-64:895912530:1663325602669:MWHE01=true
>  failed with status code 403
> Response message:
> This namenode has storage info -63:895912530:1663325602669:MWHE01 but the 
> secondary expected -64:895912530:1663325602669:MWHE01
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:583)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1717)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1819) 
> [hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at https://MWHEEEAP002D9A2:81{code}
> This is caused because the namespaceInfo sent from the proxy node does not 
> include the effective layout version, which causes BootstrapStandby to send a 
> request with a storageinfo param using the service layout version. This 
> causes the proxy node to refuse the request, because it compares the 
> storageinfo param against its storage info, which uses the effective layout 
> version, not the service layout version. 
> To fix this we can modify the proxy.versionRequest() call stack to set the 
> layout version using the effective layout version on the proxy node. We can 
> then add logic to BootstrapStandby to properly handle the case where the 
> proxy node is in rolling upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17178) BootstrapStandby needs to handle RollingUpgrade

2023-09-03 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17178:

Summary: BootstrapStandby needs to handle RollingUpgrade   (was: Bootstrap 
Standby needs to handle RollingUpgrade )

> BootstrapStandby needs to handle RollingUpgrade 
> 
>
> Key: HDFS-17178
> URL: https://issues.apache.org/jira/browse/HDFS-17178
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Fix For: 3.3.4
>
>
> During rollingUpgrade, bootstrapStandby will fail with an exception due to 
> different NameNodeLayoutVersions. We can ignore this safely during 
> RollingUpgrade because different NameNodeLayoutVersions are expected.
>  * NameNodes will not be able to recover with BootstrapStandby if they go 
> through destructive repair before the rollingUpgrade has been finalized.
> Error during BootstrapStandby before change:
> {code:java}
> =
> About to bootstrap Standby ID nn2 from:
>Nameservice ID: MTPrime-MWHE01-0
> Other Namenode ID: nn1
>   Other NN's HTTP address: https://MWHEEEAP002D9A2:81
>   Other NN's IPC  address: MWHEEEAP002D9A2.ap.gbl/10.59.208.18:8020
>  Namespace ID: 895912530
> Block pool ID: BP-1556042256-10.99.154.61-1663325602669
>Cluster ID: MWHE01
>Layout version: -64
>isUpgradeFinalized: true
> =
> 2023-08-28T19:35:06,940 ERROR [main] namenode.NameNode: Failed to start 
> namenode.
> java.io.IOException: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at 
> https://MWHEEEAP002D9A2:81/imagetransfer?getimage=1=25683470=-64:895912530:1663325602669:MWHE01=true
>  failed with status code 403
> Response message:
> This namenode has storage info -63:895912530:1663325602669:MWHE01 but the 
> secondary expected -64:895912530:1663325602669:MWHE01
> at 
> org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:583)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1717)
>  ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1819) 
> [hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
>  Image transfer servlet at https://MWHEEEAP002D9A2:81{code}
> This is caused because the namespaceInfo sent from the proxy node does not 
> include the effective layout version, which causes BootstrapStandby to send a 
> request with a storageinfo param using the service layout version. This 
> causes the proxy node to refuse the request, because it compares the 
> storageinfo param against its storage info, which uses the effective layout 
> version, not the service layout version. 
> To fix this we can modify the proxy.versionRequest() call stack to set the 
> layout version using the effective layout version on the proxy node. We can 
> then add logic to BootstrapStandby to properly handle the case where the 
> proxy node is in rolling upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17167) Observer NameNode -observer startup option conflicts with -rollingUpgrade startup option

2023-09-03 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker reassigned HDFS-17167:
---

Assignee: Danny Becker

> Observer NameNode -observer startup option conflicts with -rollingUpgrade 
> startup option
> 
>
> Key: HDFS-17167
> URL: https://issues.apache.org/jira/browse/HDFS-17167
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-17167.001.patch
>
>
> NameNode currently uses a "StartupOption" to decide whether to start a node 
> as an Observer NameNode. This causes an issue during a rolling upgrade 
> because "rollingUpgrade" is also a StartupOption and NameNode will only allow 
> 1 startup option, choosing the last startup option in the list. Observer in 
> our environment starts with the following startup options: 
> ["-rollingUpgrade", "started", "-observer"]. This means that the rolling 
> upgrade gets ignored which causes Observer to have an issue when an actual 
> rolling upgrade is ongoing:
> {code:java}
> 2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
>  exception loading fsimage
> java.io.IOException: 
> File system image contains an old layout version -63.
> An upgrade to version -66 is required.
> Please restart NameNode with the \'-rollingUpgrade started\' option if a 
> rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
> option to start a new upgrade.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
> "{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17178) Bootstrap Standby needs to handle RollingUpgrade

2023-09-03 Thread Danny Becker (Jira)
Danny Becker created HDFS-17178:
---

 Summary: Bootstrap Standby needs to handle RollingUpgrade 
 Key: HDFS-17178
 URL: https://issues.apache.org/jira/browse/HDFS-17178
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Danny Becker
Assignee: Danny Becker
 Fix For: 3.3.4


During rollingUpgrade, bootstrapStandby will fail with an exception due to 
different NameNodeLayoutVersions. We can ignore this safely during 
RollingUpgrade because different NameNodeLayoutVersions are expected.
 * NameNodes will not be able to recover with BootstrapStandby if they go 
through destructive repair before the rollingUpgrade has been finalized.

Error during BootstrapStandby before change:
{code:java}
=
About to bootstrap Standby ID nn2 from:
   Nameservice ID: MTPrime-MWHE01-0
Other Namenode ID: nn1
  Other NN's HTTP address: https://MWHEEEAP002D9A2:81
  Other NN's IPC  address: MWHEEEAP002D9A2.ap.gbl/10.59.208.18:8020
 Namespace ID: 895912530
Block pool ID: BP-1556042256-10.99.154.61-1663325602669
   Cluster ID: MWHE01
   Layout version: -64
   isUpgradeFinalized: true
=
2023-08-28T19:35:06,940 ERROR [main] namenode.NameNode: Failed to start 
namenode.
java.io.IOException: java.lang.RuntimeException: 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException: 
Image transfer servlet at 
https://MWHEEEAP002D9A2:81/imagetransfer?getimage=1=25683470=-64:895912530:1663325602669:MWHE01=true
 failed with status code 403
Response message:
This namenode has storage info -63:895912530:1663325602669:MWHE01 but the 
secondary expected -64:895912530:1663325602669:MWHE01
at 
org.apache.hadoop.hdfs.server.namenode.ha.BootstrapStandby.run(BootstrapStandby.java:583)
 ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1717)
 ~[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1819) 
[hadoop-hdfs-2.9.2-MT-SNAPSHOT.jar:?]
Caused by: java.lang.RuntimeException: 
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException: 
Image transfer servlet at https://MWHEEEAP002D9A2:81{code}
This is caused because the namespaceInfo sent from the proxy node does not 
include the effective layout version, which causes BootstrapStandby to send a 
request with a storageinfo param using the service layout version. This causes 
the proxy node to refuse the request, because it compares the storageinfo param 
against its storage info, which uses the effective layout version, not the 
service layout version. 

To fix this we can modify the proxy.versionRequest() call stack to set the 
layout version using the effective layout version on the proxy node. We can 
then add logic to BootstrapStandby to properly handle the case where the proxy 
node is in rolling upgrade.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17167) Observer NameNode -observer startup option conflicts with -rollingUpgrade startup option

2023-08-31 Thread Danny Becker (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17761157#comment-17761157
 ] 

Danny Becker commented on HDFS-17167:
-

[HDFS-17167. Add config to startup NameNode as Observer by dannytbecker · Pull 
Request #6013 · apache/hadoop 
(github.com)|https://github.com/apache/hadoop/pull/6013]

> Observer NameNode -observer startup option conflicts with -rollingUpgrade 
> startup option
> 
>
> Key: HDFS-17167
> URL: https://issues.apache.org/jira/browse/HDFS-17167
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Danny Becker
>Priority: Minor
> Attachments: HDFS-17167.001.patch
>
>
> NameNode currently uses a "StartupOption" to decide whether to start a node 
> as an Observer NameNode. This causes an issue during a rolling upgrade 
> because "rollingUpgrade" is also a StartupOption and NameNode will only allow 
> 1 startup option, choosing the last startup option in the list. Observer in 
> our environment starts with the following startup options: 
> ["-rollingUpgrade", "started", "-observer"]. This means that the rolling 
> upgrade gets ignored which causes Observer to have an issue when an actual 
> rolling upgrade is ongoing:
> {code:java}
> 2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
>  exception loading fsimage
> java.io.IOException: 
> File system image contains an old layout version -63.
> An upgrade to version -66 is required.
> Please restart NameNode with the \'-rollingUpgrade started\' option if a 
> rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
> option to start a new upgrade.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
> "{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17167) Observer NameNode -observer startup option conflicts with -rollingUpgrade startup option

2023-08-25 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17167:

Summary: Observer NameNode -observer startup option conflicts with 
-rollingUpgrade startup option  (was: Observer NameNode startup option)

> Observer NameNode -observer startup option conflicts with -rollingUpgrade 
> startup option
> 
>
> Key: HDFS-17167
> URL: https://issues.apache.org/jira/browse/HDFS-17167
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Danny Becker
>Priority: Minor
> Attachments: HDFS-17167.001.patch
>
>
> NameNode currently uses a "StartupOption" to decide whether to start a node 
> as an Observer NameNode. This causes an issue during a rolling upgrade 
> because "rollingUpgrade" is also a StartupOption and NameNode will only allow 
> 1 startup option, choosing the last startup option in the list. Observer in 
> our environment starts with the following startup options: 
> ["-rollingUpgrade", "started", "-observer"]. This means that the rolling 
> upgrade gets ignored which causes Observer to have an issue when an actual 
> rolling upgrade is ongoing:
> {code:java}
> 2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
>  exception loading fsimage
> java.io.IOException: 
> File system image contains an old layout version -63.
> An upgrade to version -66 is required.
> Please restart NameNode with the \'-rollingUpgrade started\' option if a 
> rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
> option to start a new upgrade.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
> "{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17167) Observer NameNode startup option

2023-08-25 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17167:

Attachment: HDFS-17167.001.patch
Status: Patch Available  (was: Open)

> Observer NameNode startup option
> 
>
> Key: HDFS-17167
> URL: https://issues.apache.org/jira/browse/HDFS-17167
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Danny Becker
>Priority: Minor
> Attachments: HDFS-17167.001.patch
>
>
> NameNode currently uses a "StartupOption" to decide whether to start a node 
> as an Observer NameNode. This causes an issue during a rolling upgrade 
> because "rollingUpgrade" is also a StartupOption and NameNode will only allow 
> 1 startup option, choosing the last startup option in the list. Observer in 
> our environment starts with the following startup options: 
> ["-rollingUpgrade", "started", "-observer"]. This means that the rolling 
> upgrade gets ignored which causes Observer to have an issue when an actual 
> rolling upgrade is ongoing:
> {code:java}
> 2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
>  exception loading fsimage
> java.io.IOException: 
> File system image contains an old layout version -63.
> An upgrade to version -66 is required.
> Please restart NameNode with the \'-rollingUpgrade started\' option if a 
> rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
> option to start a new upgrade.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
> "{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17167) Observer NameNode startup option

2023-08-25 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-17167:

Description: 
NameNode currently uses a "StartupOption" to decide whether to start a node as 
an Observer NameNode. This causes an issue during a rolling upgrade because 
"rollingUpgrade" is also a StartupOption and NameNode will only allow 1 startup 
option, choosing the last startup option in the list. Observer in our 
environment starts with the following startup options: ["-rollingUpgrade", 
"started", "-observer"]. This means that the rolling upgrade gets ignored which 
causes Observer to have an issue when an actual rolling upgrade is ongoing:
{code:java}
2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
 exception loading fsimage
java.io.IOException: 
File system image contains an old layout version -63.
An upgrade to version -66 is required.
Please restart NameNode with the \'-rollingUpgrade started\' option if a 
rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
option to start a new upgrade.
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
"{code}

  was:
NameNode currently uses a "StartupOption" to decide whether to start a node as 
an Observer NameNode. This causes an issue during a rolling upgrade because 
"rollingUpgrade" is also a StartupOption and NameNode will only allow 1 startup 
option, choosing the last startup option in the list. Observer in our 
environment starts with the following startup options: ["-rollingUpgrade", 
"started", "-observer"]. This means that the rolling upgrade gets ignored which 
causes Observer to have an issue when an actual rolling upgrade is ongoing:
2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
 exception loading fsimage
java.io.IOException: 
File system image contains an old layout version -63.
An upgrade to version -66 is required.
Please restart NameNode with the \'-rollingUpgrade started\' option if a 
rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
option to start a new upgrade.
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
"
{{}}


> Observer NameNode startup option
> 
>
> Key: HDFS-17167
> URL: https://issues.apache.org/jira/browse/HDFS-17167
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Danny Becker
>Priority: Minor
>
> NameNode currently uses a "StartupOption" to decide whether to start a node 
> as an Observer NameNode. This causes an issue during a rolling upgrade 
> because "rollingUpgrade" is also a StartupOption and NameNode will only allow 
> 1 startup option, choosing the last startup option in the list. Observer in 
> our environment starts with the following startup options: 
> ["-rollingUpgrade", "started", "-observer"]. This means that the rolling 
> upgrade gets ignored which causes Observer to have an issue when an actual 
> rolling upgrade is ongoing:
> {code:java}
> 2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
>  exception loading fsimage
> java.io.IOException: 
> File system image contains an old layout version -63.
> An upgrade to version -66 is required.
> Please restart NameNode 

[jira] [Created] (HDFS-17167) Observer NameNode startup option

2023-08-25 Thread Danny Becker (Jira)
Danny Becker created HDFS-17167:
---

 Summary: Observer NameNode startup option
 Key: HDFS-17167
 URL: https://issues.apache.org/jira/browse/HDFS-17167
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.3.4
Reporter: Danny Becker


NameNode currently uses a "StartupOption" to decide whether to start a node as 
an Observer NameNode. This causes an issue during a rolling upgrade because 
"rollingUpgrade" is also a StartupOption and NameNode will only allow 1 startup 
option, choosing the last startup option in the list. Observer in our 
environment starts with the following startup options: ["-rollingUpgrade", 
"started", "-observer"]. This means that the rolling upgrade gets ignored which 
causes Observer to have an issue when an actual rolling upgrade is ongoing:
2023-08-23T14:59:03.486-0700,WARN,[main],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"Encountered
 exception loading fsimage
java.io.IOException: 
File system image contains an old layout version -63.
An upgrade to version -66 is required.
Please restart NameNode with the \'-rollingUpgrade started\' option if a 
rolling upgrade is already started; or restart NameNode with the \'-upgrade\' 
option to start a new upgrade.
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:271)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1116)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:724)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:681)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:763)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:1013)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:992)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1743)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1811)
"
{{}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics

2020-01-17 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15071:

Attachment: HDFS-15071.004.patch

> Add DataNode Read and Write throughput percentile metrics
> -
>
> Key: HDFS-15071
> URL: https://issues.apache.org/jira/browse/HDFS-15071
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, metrics
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15071.000.patch, HDFS-15071.001.patch, 
> HDFS-15071.002.patch, HDFS-15071.003.patch, HDFS-15071.004.patch
>
>
> Add DataNode throughput metrics for read and write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics

2020-01-09 Thread Danny Becker (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17012176#comment-17012176
 ] 

Danny Becker commented on HDFS-15071:
-

v003 should fix all checkstyle errors except for two "line longer than 80 
characters" errors. Those two errors are following convention in the code 
around it.

> Add DataNode Read and Write throughput percentile metrics
> -
>
> Key: HDFS-15071
> URL: https://issues.apache.org/jira/browse/HDFS-15071
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, metrics
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15071.000.patch, HDFS-15071.001.patch, 
> HDFS-15071.002.patch, HDFS-15071.003.patch
>
>
> Add DataNode throughput metrics for read and write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics

2020-01-09 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15071:

Attachment: HDFS-15071.003.patch

> Add DataNode Read and Write throughput percentile metrics
> -
>
> Key: HDFS-15071
> URL: https://issues.apache.org/jira/browse/HDFS-15071
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, metrics
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15071.000.patch, HDFS-15071.001.patch, 
> HDFS-15071.002.patch, HDFS-15071.003.patch
>
>
> Add DataNode throughput metrics for read and write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics

2020-01-08 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15071:

Attachment: HDFS-15071.002.patch

> Add DataNode Read and Write throughput percentile metrics
> -
>
> Key: HDFS-15071
> URL: https://issues.apache.org/jira/browse/HDFS-15071
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, metrics
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15071.000.patch, HDFS-15071.001.patch, 
> HDFS-15071.002.patch
>
>
> Add DataNode throughput metrics for read and write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8591) Remove support for deprecated configuration key dfs.namenode.decommission.nodes.per.interval

2020-01-02 Thread Danny Becker (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-8591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007166#comment-17007166
 ] 

Danny Becker commented on HDFS-8591:


There is an issue with the logic here which can cause the decommissioner to get 
stuck in a nearly infinite loop. The decommissioner checks a datanode which is 
in_maintenance and no blocks are checked. The decommissioner will continue to 
loop through this until the datanode is no longer in_maintenance or it reaches 
Integer.MAX_VALUE.

> Remove support for deprecated configuration key 
> dfs.namenode.decommission.nodes.per.interval
> 
>
> Key: HDFS-8591
> URL: https://issues.apache.org/jira/browse/HDFS-8591
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>Priority: Minor
> Fix For: 3.0.0-alpha1
>
> Attachments: hdfs-8591.001.patch
>
>
> dfs.namenode.decommission.nodes.per.interval is deprecated in branch-2 and 
> can be removed in trunk.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics

2019-12-18 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15071:

Attachment: HDFS-15071.001.patch

> Add DataNode Read and Write throughput percentile metrics
> -
>
> Key: HDFS-15071
> URL: https://issues.apache.org/jira/browse/HDFS-15071
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, metrics
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15071.000.patch, HDFS-15071.001.patch
>
>
> Add DataNode throughput metrics for read and write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics

2019-12-18 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15071:

Attachment: HDFS-15071.000.patch

> Add DataNode Read and Write throughput percentile metrics
> -
>
> Key: HDFS-15071
> URL: https://issues.apache.org/jira/browse/HDFS-15071
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, metrics
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15071.000.patch
>
>
> Add DataNode throughput metrics for read and write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics

2019-12-18 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker reassigned HDFS-15071:
---

Assignee: Danny Becker

> Add DataNode Read and Write throughput percentile metrics
> -
>
> Key: HDFS-15071
> URL: https://issues.apache.org/jira/browse/HDFS-15071
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, metrics
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
>
> Add DataNode throughput metrics for read and write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15071) Add DataNode Read and Write throughput percentile metrics

2019-12-18 Thread Danny Becker (Jira)
Danny Becker created HDFS-15071:
---

 Summary: Add DataNode Read and Write throughput percentile metrics
 Key: HDFS-15071
 URL: https://issues.apache.org/jira/browse/HDFS-15071
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, hdfs, metrics
Reporter: Danny Becker


Add DataNode throughput metrics for read and write.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-18 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15031:

Attachment: HDFS-15031.007.patch

> Allow BootstrapStandby to download FSImage if the directory is already 
> formatted
> 
>
> Key: HDFS-15031
> URL: https://issues.apache.org/jira/browse/HDFS-15031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15031.000.patch, HDFS-15031.001.patch, 
> HDFS-15031.002.patch, HDFS-15031.003.patch, HDFS-15031.005.patch, 
> HDFS-15031.006.patch, HDFS-15031.007.patch
>
>
> Currently, BootstrapStandby will only download the latest FSImage if it has 
> formatted the local image directory. This can be an issue when there are out 
> of date FSImages on a Standby NameNode, as the non-interactive mode will not 
> format the image directory, and BootstrapStandby will return an error code. 
> The changes here simply allow BootstrapStandby to download the latest FSImage 
> to the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-06 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15031:

Attachment: HDFS-15031.006.patch

> Allow BootstrapStandby to download FSImage if the directory is already 
> formatted
> 
>
> Key: HDFS-15031
> URL: https://issues.apache.org/jira/browse/HDFS-15031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15031.000.patch, HDFS-15031.001.patch, 
> HDFS-15031.002.patch, HDFS-15031.003.patch, HDFS-15031.005.patch, 
> HDFS-15031.006.patch
>
>
> Currently, BootstrapStandby will only download the latest FSImage if it has 
> formatted the local image directory. This can be an issue when there are out 
> of date FSImages on a Standby NameNode, as the non-interactive mode will not 
> format the image directory, and BootstrapStandby will return an error code. 
> The changes here simply allow BootstrapStandby to download the latest FSImage 
> to the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-05 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15031:

Attachment: HDFS-15031.005.patch

> Allow BootstrapStandby to download FSImage if the directory is already 
> formatted
> 
>
> Key: HDFS-15031
> URL: https://issues.apache.org/jira/browse/HDFS-15031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15031.000.patch, HDFS-15031.001.patch, 
> HDFS-15031.002.patch, HDFS-15031.003.patch, HDFS-15031.005.patch
>
>
> Currently, BootstrapStandby will only download the latest FSImage if it has 
> formatted the local image directory. This can be an issue when there are out 
> of date FSImages on a Standby NameNode, as the non-interactive mode will not 
> format the image directory, and BootstrapStandby will return an error code. 
> The changes here simply allow BootstrapStandby to download the latest FSImage 
> to the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-05 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15031:

Attachment: HDFS-15031.003.patch

> Allow BootstrapStandby to download FSImage if the directory is already 
> formatted
> 
>
> Key: HDFS-15031
> URL: https://issues.apache.org/jira/browse/HDFS-15031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15031.000.patch, HDFS-15031.001.patch, 
> HDFS-15031.002.patch, HDFS-15031.003.patch
>
>
> Currently, BootstrapStandby will only download the latest FSImage if it has 
> formatted the local image directory. This can be an issue when there are out 
> of date FSImages on a Standby NameNode, as the non-interactive mode will not 
> format the image directory, and BootstrapStandby will return an error code. 
> The changes here simply allow BootstrapStandby to download the latest FSImage 
> to the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15031:

Attachment: HDFS-15031.002.patch

> Allow BootstrapStandby to download FSImage if the directory is already 
> formatted
> 
>
> Key: HDFS-15031
> URL: https://issues.apache.org/jira/browse/HDFS-15031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15031.000.patch, HDFS-15031.001.patch, 
> HDFS-15031.002.patch
>
>
> Currently, BootstrapStandby will only download the latest FSImage if it has 
> formatted the local image directory. This can be an issue when there are out 
> of date FSImages on a Standby NameNode, as the non-interactive mode will not 
> format the image directory, and BootstrapStandby will return an error code. 
> The changes here simply allow BootstrapStandby to download the latest FSImage 
> to the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-04 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15031:

Attachment: HDFS-15031.001.patch

> Allow BootstrapStandby to download FSImage if the directory is already 
> formatted
> 
>
> Key: HDFS-15031
> URL: https://issues.apache.org/jira/browse/HDFS-15031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15031.000.patch, HDFS-15031.001.patch
>
>
> Currently, BootstrapStandby will only download the latest FSImage if it has 
> formatted the local image directory. This can be an issue when there are out 
> of date FSImages on a Standby NameNode, as the non-interactive mode will not 
> format the image directory, and BootstrapStandby will return an error code. 
> The changes here simply allow BootstrapStandby to download the latest FSImage 
> to the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-03 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-15031:

Attachment: HDFS-15031.000.patch

> Allow BootstrapStandby to download FSImage if the directory is already 
> formatted
> 
>
> Key: HDFS-15031
> URL: https://issues.apache.org/jira/browse/HDFS-15031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-15031.000.patch
>
>
> Currently, BootstrapStandby will only download the latest FSImage if it has 
> formatted the local image directory. This can be an issue when there are out 
> of date FSImages on a Standby NameNode, as the non-interactive mode will not 
> format the image directory, and BootstrapStandby will return an error code. 
> The changes here simply allow BootstrapStandby to download the latest FSImage 
> to the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-03 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker reassigned HDFS-15031:
---

Assignee: Danny Becker

> Allow BootstrapStandby to download FSImage if the directory is already 
> formatted
> 
>
> Key: HDFS-15031
> URL: https://issues.apache.org/jira/browse/HDFS-15031
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
>
> Currently, BootstrapStandby will only download the latest FSImage if it has 
> formatted the local image directory. This can be an issue when there are out 
> of date FSImages on a Standby NameNode, as the non-interactive mode will not 
> format the image directory, and BootstrapStandby will return an error code. 
> The changes here simply allow BootstrapStandby to download the latest FSImage 
> to the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15031) Allow BootstrapStandby to download FSImage if the directory is already formatted

2019-12-03 Thread Danny Becker (Jira)
Danny Becker created HDFS-15031:
---

 Summary: Allow BootstrapStandby to download FSImage if the 
directory is already formatted
 Key: HDFS-15031
 URL: https://issues.apache.org/jira/browse/HDFS-15031
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs, namenode
Reporter: Danny Becker


Currently, BootstrapStandby will only download the latest FSImage if it has 
formatted the local image directory. This can be an issue when there are out of 
date FSImages on a Standby NameNode, as the non-interactive mode will not 
format the image directory, and BootstrapStandby will return an error code. The 
changes here simply allow BootstrapStandby to download the latest FSImage to 
the image directory, without needing to format first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14851) WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks

2019-09-20 Thread Danny Becker (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934861#comment-16934861
 ] 

Danny Becker edited comment on HDFS-14851 at 9/21/19 1:05 AM:
--

[~jojochuang] HDFS already does this behavior. The problem here is that WebHdfs 
will return a 200 code an then fail the read because it has a missing block. 
HTTP protocol does not allow us to send an error response when it is expecting 
to read data. So we need to catch this before sending the response code.


was (Author: dannytbecker):
[~jojochuang] HDFS already does this. behavior. The problem here is that 
WebHdfs will return a 200 code an then fail the read because it has a missing 
block. HTTP protocol does not allow us to send an error response when it is 
expecting to read data. So we need to catch this before sending the response 
code.

> WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks
> -
>
> Key: HDFS-14851
> URL: https://issues.apache.org/jira/browse/HDFS-14851
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14851.001.patch
>
>
> WebHdfs returns 200 status code for Open operations on files with missing or 
> corrupt blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14851) WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks

2019-09-20 Thread Danny Becker (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934861#comment-16934861
 ] 

Danny Becker commented on HDFS-14851:
-

[~jojochuang] HDFS already does this. behavior. The problem here is that 
WebHdfs will return a 200 code an then fail the read because it has a missing 
block. HTTP protocol does not allow us to send an error response when it is 
expecting to read data. So we need to catch this before sending the response 
code.

> WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks
> -
>
> Key: HDFS-14851
> URL: https://issues.apache.org/jira/browse/HDFS-14851
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14851.001.patch
>
>
> WebHdfs returns 200 status code for Open operations on files with missing or 
> corrupt blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14851) WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks

2019-09-20 Thread Danny Becker (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934846#comment-16934846
 ] 

Danny Becker commented on HDFS-14851:
-

[~crh] I understand the concern of iterating over every block to check for a 
corrupt flag. I disagree with the solution of failing if any block in the file 
is corrupt. What if the user reads only a portion of the file which doesn't 
contain corrupt blocks? I don't think parallelism would work for us here 
either, because it isn't a very computational task, just checking a boolean 
flag.

> WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks
> -
>
> Key: HDFS-14851
> URL: https://issues.apache.org/jira/browse/HDFS-14851
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14851.001.patch
>
>
> WebHdfs returns 200 status code for Open operations on files with missing or 
> corrupt blocks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-14851) WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks

2019-09-16 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker reassigned HDFS-14851:
---

Assignee: Danny Becker

> WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks
> -
>
> Key: HDFS-14851
> URL: https://issues.apache.org/jira/browse/HDFS-14851
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
>
> WebHdfs returns 200 status code for Open operations on files with missing or 
> corrupt blocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14851) WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks

2019-09-16 Thread Danny Becker (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14851:

Attachment: HDFS-14851.001.patch

> WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks
> -
>
> Key: HDFS-14851
> URL: https://issues.apache.org/jira/browse/HDFS-14851
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14851.001.patch
>
>
> WebHdfs returns 200 status code for Open operations on files with missing or 
> corrupt blocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14851) WebHdfs Returns 200 Status Code for Open of Files with Corrupt Blocks

2019-09-16 Thread Danny Becker (Jira)
Danny Becker created HDFS-14851:
---

 Summary: WebHdfs Returns 200 Status Code for Open of Files with 
Corrupt Blocks
 Key: HDFS-14851
 URL: https://issues.apache.org/jira/browse/HDFS-14851
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Reporter: Danny Becker


WebHdfs returns 200 status code for Open operations on files with missing or 
corrupt blocks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14377) Incorrect unit abbreviations shown for fmt_bytes

2019-03-19 Thread Danny Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796399#comment-16796399
 ] 

Danny Becker edited comment on HDFS-14377 at 3/19/19 6:34 PM:
--

I would prefer using the SI units here (Kilo, Mega, Giga, etc.) since they are 
easier to convert to higher orders since 1.5 TB = 1500 GB while 1.5 TiB = 1536 
GiB. I could keep the units as they are in the patch or change them to SI units 
and change the divisor from 1024 to 1000. Either course of action makes sense 
to me. I don't think we need to extend this change to other areas since this is 
a change to user interface and can be seen by less experienced users, so 
clearer and accurate units here are more important.

I agree with [~anu] that we should not go into the CLI or other areas with this 
change.


was (Author: dannytbecker):
I would prefer using the SI units here (Kilo, Mega, Giga, etc.) since they are 
easier to convert to higher orders since 1.5 TB = 1500 GB while 1.5 TiB = 1536 
GiB. I could keep the units as they are in the patch or change them to SI units 
and change the divisor from 1024 to 1000. Either course of action makes sense 
to me. I don't think we need to extend this change to other areas since this is 
a change to user interface and can be seen by less experienced users, so 
clearer and accurate units here are more important. I agree with [~anu] that we 
should not go into the CLI or other areas with this change.

> Incorrect unit abbreviations shown for fmt_bytes
> 
>
> Key: HDFS-14377
> URL: https://issues.apache.org/jira/browse/HDFS-14377
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Trivial
> Attachments: HDFS-14377.000.patch
>
>
> The function fmt_bytes show the abbreviations for Terabyte, Petabyte, etc. 
> the standard metric system units for data storage units. The function however 
> divides by a factor of 1024, which is the factor used for Pebibyte, Tebibyte, 
> etc. Change the abbreviations from TB, PB, etc to TiB, PiB, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14377) Incorrect unit abbreviations shown for fmt_bytes

2019-03-19 Thread Danny Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16796399#comment-16796399
 ] 

Danny Becker commented on HDFS-14377:
-

I would prefer using the SI units here (Kilo, Mega, Giga, etc.) since they are 
easier to convert to higher orders since 1.5 TB = 1500 GB while 1.5 TiB = 1536 
GiB. I could keep the units as they are in the patch or change them to SI units 
and change the divisor from 1024 to 1000. Either course of action makes sense 
to me. I don't think we need to extend this change to other areas since this is 
a change to user interface and can be seen by less experienced users, so 
clearer and accurate units here are more important. I agree with [~anu] that we 
should not go into the CLI or other areas with this change.

> Incorrect unit abbreviations shown for fmt_bytes
> 
>
> Key: HDFS-14377
> URL: https://issues.apache.org/jira/browse/HDFS-14377
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Trivial
> Attachments: HDFS-14377.000.patch
>
>
> The function fmt_bytes show the abbreviations for Terabyte, Petabyte, etc. 
> the standard metric system units for data storage units. The function however 
> divides by a factor of 1024, which is the factor used for Pebibyte, Tebibyte, 
> etc. Change the abbreviations from TB, PB, etc to TiB, PiB, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14377) Incorrect unit abbreviations shown for fmt_bytes

2019-03-18 Thread Danny Becker (JIRA)
Danny Becker created HDFS-14377:
---

 Summary: Incorrect unit abbreviations shown for fmt_bytes
 Key: HDFS-14377
 URL: https://issues.apache.org/jira/browse/HDFS-14377
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Danny Becker


The function fmt_bytes show the abbreviations for Terabyte, Petabyte, etc. the 
standard metric system units for data storage units. The function however 
divides by a factor of 1024, which is the factor used for Pebibyte, Tebibyte, 
etc. Change the abbreviations from TB, PB, etc to TiB, PiB, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14377) Incorrect unit abbreviations shown for fmt_bytes

2019-03-18 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14377:

Attachment: HDFS-14377.000.patch

> Incorrect unit abbreviations shown for fmt_bytes
> 
>
> Key: HDFS-14377
> URL: https://issues.apache.org/jira/browse/HDFS-14377
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Danny Becker
>Priority: Trivial
> Attachments: HDFS-14377.000.patch
>
>
> The function fmt_bytes show the abbreviations for Terabyte, Petabyte, etc. 
> the standard metric system units for data storage units. The function however 
> divides by a factor of 1024, which is the factor used for Pebibyte, Tebibyte, 
> etc. Change the abbreviations from TB, PB, etc to TiB, PiB, etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14084) Need for more stats in DFSClient

2019-03-07 Thread Danny Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787165#comment-16787165
 ] 

Danny Becker commented on HDFS-14084:
-

[~xkrogen] It looks like that error is caused by the use of Arrays.toString() 
on clientId which is a byte array. I see that in other parts of Client.java, 
StringUtils.byteToHexString() is used on clientId to print it or convert it to 
a string. Should that be used instead of Arrays.toString() for converting the 
byte array into a string?

> Need for more stats in DFSClient
> 
>
> Key: HDFS-14084
> URL: https://issues.apache.org/jira/browse/HDFS-14084
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.0.0
>Reporter: Pranay Singh
>Assignee: Erik Krogen
>Priority: Minor
> Attachments: HDFS-14084.001.patch, HDFS-14084.002.patch, 
> HDFS-14084.003.patch, HDFS-14084.004.patch, HDFS-14084.005.patch, 
> HDFS-14084.006.patch, HDFS-14084.007.patch, HDFS-14084.008.patch, 
> HDFS-14084.009.patch, HDFS-14084.010.patch, HDFS-14084.011.patch, 
> HDFS-14084.012.patch, HDFS-14084.013.patch, HDFS-14084.014.patch, 
> HDFS-14084.015.patch, HDFS-14084.016.patch, HDFS-14084.017.patch, 
> HDFS-14084.018.patch
>
>
> The usage of HDFS has changed from being used as a map-reduce filesystem, now 
> it's becoming more of like a general purpose filesystem. In most of the cases 
> there are issues with the Namenode so we have metrics to know the workload or 
> stress on Namenode.
> However, there is a need to have more statistics collected for different 
> operations/RPCs in DFSClient to know which RPC operations are taking longer 
> time or to know what is the frequency of the operation.These statistics can 
> be exposed to the users of DFS Client and they can periodically log or do 
> some sort of flow control if the response is slow. This will also help to 
> isolate HDFS issue in a mixed environment where on a node say we have Spark, 
> HBase and Impala running together. We can check the throughput of different 
> operation across client and isolate the problem caused because of noisy 
> neighbor or network congestion or shared JVM.
> We have dealt with several problems from the field for which there is no 
> conclusive evidence as to what caused the problem. If we had metrics or stats 
> in DFSClient we would be better equipped to solve such complex problems.
> List of jiras for reference:
> -
>  HADOOP-15538 HADOOP-15530 ( client side deadlock)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14326) Add CorruptFilesCount to JMX

2019-03-05 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14326:

Attachment: HDFS-14326.004.patch

> Add CorruptFilesCount to JMX
> 
>
> Key: HDFS-14326
> URL: https://issues.apache.org/jira/browse/HDFS-14326
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, metrics, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14326.000.patch, HDFS-14326.001.patch, 
> HDFS-14326.002.patch, HDFS-14326.003.patch, HDFS-14326.004.patch
>
>
> Add CorruptFilesCount to JMX



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14326) Add CorruptFilesCount to JMX

2019-03-05 Thread Danny Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784800#comment-16784800
 ] 

Danny Becker commented on HDFS-14326:
-

I uploaded a patch to base off of the 14336 changes

> Add CorruptFilesCount to JMX
> 
>
> Key: HDFS-14326
> URL: https://issues.apache.org/jira/browse/HDFS-14326
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, metrics, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14326.000.patch, HDFS-14326.001.patch, 
> HDFS-14326.002.patch, HDFS-14326.003.patch
>
>
> Add CorruptFilesCount to JMX



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14326) Add CorruptFilesCount to JMX

2019-03-05 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14326:

Attachment: HDFS-14326.003.patch

> Add CorruptFilesCount to JMX
> 
>
> Key: HDFS-14326
> URL: https://issues.apache.org/jira/browse/HDFS-14326
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, metrics, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14326.000.patch, HDFS-14326.001.patch, 
> HDFS-14326.002.patch, HDFS-14326.003.patch
>
>
> Add CorruptFilesCount to JMX



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14336) Fix checkstyle for NameNodeMXBean.java

2019-03-04 Thread Danny Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16783937#comment-16783937
 ] 

Danny Becker commented on HDFS-14336:
-

No testing was added to this patch because it only fixes checkstyle issues.

> Fix checkstyle for NameNodeMXBean.java
> --
>
> Key: HDFS-14336
> URL: https://issues.apache.org/jira/browse/HDFS-14336
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Trivial
> Attachments: HDFS-14336.000.patch, HDFS-14336.001.patch
>
>
> Fix checkstyle in NameNodeMXBean.java and make it more uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14336) Fix checkstyle for NameNodeMXBean.java

2019-03-04 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14336:

Attachment: HDFS-14336.001.patch

> Fix checkstyle for NameNodeMXBean.java
> --
>
> Key: HDFS-14336
> URL: https://issues.apache.org/jira/browse/HDFS-14336
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Trivial
> Attachments: HDFS-14336.000.patch, HDFS-14336.001.patch
>
>
> Fix checkstyle in NameNodeMXBean.java and make it more uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14336) Fix checkstyle for NameNodeMXBean.java

2019-03-04 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14336:

Attachment: HDFS-14336.000.patch
Status: Patch Available  (was: Open)

> Fix checkstyle for NameNodeMXBean.java
> --
>
> Key: HDFS-14336
> URL: https://issues.apache.org/jira/browse/HDFS-14336
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Trivial
> Attachments: HDFS-14336.000.patch
>
>
> Fix checkstyle in NameNodeMXBean.java and make it more uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14336) Fix checkstyle for NameNodeMXBean.java

2019-03-04 Thread Danny Becker (JIRA)
Danny Becker created HDFS-14336:
---

 Summary: Fix checkstyle for NameNodeMXBean.java
 Key: HDFS-14336
 URL: https://issues.apache.org/jira/browse/HDFS-14336
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Reporter: Danny Becker
Assignee: Danny Becker


Fix checkstyle in NameNodeMXBean.java and make it more uniform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14326) Add CorruptFilesCount to JMX

2019-03-01 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14326:

Attachment: HDFS-14326.002.patch

> Add CorruptFilesCount to JMX
> 
>
> Key: HDFS-14326
> URL: https://issues.apache.org/jira/browse/HDFS-14326
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, metrics, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14326.000.patch, HDFS-14326.001.patch, 
> HDFS-14326.002.patch
>
>
> Add CorruptFilesCount to JMX



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14326) Add CorruptFilesCount to JMX

2019-02-28 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14326:

Attachment: HDFS-14326.001.patch

> Add CorruptFilesCount to JMX
> 
>
> Key: HDFS-14326
> URL: https://issues.apache.org/jira/browse/HDFS-14326
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, metrics, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14326.000.patch, HDFS-14326.001.patch
>
>
> Add CorruptFilesCount to JMX



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14326) Add CorruptFilesCount to JMX

2019-02-28 Thread Danny Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16780962#comment-16780962
 ] 

Danny Becker commented on HDFS-14326:
-

[~elgoiri] The difference between getCorruptFilesCount() and 
getCorruptReplicatedBlocks() is that the former returns an integer representing 
the number of files that are corrupted, while the latter returns a list of 
corrupted replicated blocks. The number of corrupted replicated blocks could be 
greater than the number of corrupted files because there could be multiple 
corrupted blocks in one file. Does this help show the difference between the 
two?
I will refactor the two of them to avoid reusing code.

> Add CorruptFilesCount to JMX
> 
>
> Key: HDFS-14326
> URL: https://issues.apache.org/jira/browse/HDFS-14326
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, metrics, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14326.000.patch
>
>
> Add CorruptFilesCount to JMX



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14326) Add CorruptFilesCount to JMX

2019-02-28 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14326:

Attachment: HDFS-14326.000.patch
Status: Patch Available  (was: Open)

> Add CorruptFilesCount to JMX
> 
>
> Key: HDFS-14326
> URL: https://issues.apache.org/jira/browse/HDFS-14326
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, metrics, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Minor
> Attachments: HDFS-14326.000.patch
>
>
> Add CorruptFilesCount to JMX



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14326) Add CorruptFilesCount to JMX

2019-02-28 Thread Danny Becker (JIRA)
Danny Becker created HDFS-14326:
---

 Summary: Add CorruptFilesCount to JMX
 Key: HDFS-14326
 URL: https://issues.apache.org/jira/browse/HDFS-14326
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: fs, metrics, namenode
Reporter: Danny Becker
Assignee: Danny Becker


Add CorruptFilesCount to JMX



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-12-03 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Attachment: HDFS-14069.002.patch

> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch, HDFS-14069.001.patch, 
> HDFS-14069.002.patch, HDFS-14069.002.patch
>
>
> Currently, we don't provide any debugging info for decommissioning DN, it is 
> difficult to determine which blocks are on their last replica. We have two 
> design options:
>  # Add block info for blocks with low replication (configurable)
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  *** Easier initial implementation
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  *** More difficult to add debugging information later on
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-12-03 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Attachment: (was: HDFS-14069.002.patch)

> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch, HDFS-14069.001.patch, 
> HDFS-14069.002.patch
>
>
> Currently, we don't provide any debugging info for decommissioning DN, it is 
> difficult to determine which blocks are on their last replica. We have two 
> design options:
>  # Add block info for blocks with low replication (configurable)
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  *** Easier initial implementation
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  *** More difficult to add debugging information later on
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-11-30 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Attachment: HDFS-14069.002.patch

> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch, HDFS-14069.001.patch, 
> HDFS-14069.002.patch
>
>
> Currently, we don't provide any debugging info for decommissioning DN, it is 
> difficult to determine which blocks are on their last replica. We have two 
> design options:
>  # Add block info for blocks with low replication (configurable)
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  *** Easier initial implementation
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  *** More difficult to add debugging information later on
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14069) Better debuggability for datanode decommissioning

2018-11-28 Thread Danny Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702530#comment-16702530
 ] 

Danny Becker commented on HDFS-14069:
-

[~manojg] [~xkrogen] Is this something you would be interested in or could 
provide feedback for?

> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch, HDFS-14069.001.patch
>
>
> Currently, we don't provide any debugging info for decommissioning DN, it is 
> difficult to determine which blocks are on their last replica. We have two 
> design options:
>  # Add block info for blocks with low replication (configurable)
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  *** Easier initial implementation
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  *** More difficult to add debugging information later on
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-11-13 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Attachment: HDFS-14069.001.patch

> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch, HDFS-14069.001.patch
>
>
> Currently, we don't provide any debugging info for decommissioning DN, it is 
> difficult to determine which blocks are on their last replica. We have two 
> design options:
>  # Add block info for blocks with low replication (configurable)
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  *** Easier initial implementation
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  *** More difficult to add debugging information later on
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we don't provide any debugging info for decommissioning DN, it is 
difficult to determine which blocks are on their last replica. We have two 
design options:
 # Add block info for blocks with low replication (configurable)
 ** Advantages:
 *** Initial debugging information would be more thorough
 *** Easier initial implementation
 ** Disadvantages:
 *** Add load to normal NN operation by checking every time a DN is 
decommissioned
 *** More difficult to add debugging information later on
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.

  was:
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 *** Initial debugging information would be more thorough
 *** Easier initial implementation
 ** Disadvantages:
 *** Add load to normal NN operation by checking every time a DN is 
decommissioned
 *** More difficult to add debugging information later on
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.


> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Assignee: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we don't provide any debugging info for decommissioning DN, it is 
> difficult to determine which blocks are on their last replica. We have two 
> design options:
>  # Add block info for blocks with low replication (configurable)
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  *** Easier initial implementation
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  *** More difficult to add debugging information later on
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decomissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 *** Initial debugging information would be more thorough
 ** Disadvantages:
 *** Add load to normal NN operation by checking every time a DN is 
decommissioned
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.

  was:
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 *** Initial debugging information would be more thorough
 ** Disadvantages:
 *** 
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.


> Better debuggability for datanode decomissioning
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being 
> decommissioned, this is not enough info because it is difficult to determine 
> which blocks are on their last replica. We have two design options:
>  # Add it to the existing report, on top of minLiveReplicas
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 *** Initial debugging information would be more thorough
 *** Easier initial implementation
 ** Disadvantages:
 *** Add load to normal NN operation by checking every time a DN is 
decommissioned
 *** More difficult to add debugging information later on
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.

  was:
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 *** Initial debugging information would be more thorough
 ** Disadvantages:
 *** Add load to normal NN operation by checking every time a DN is 
decommissioned
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.


> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being 
> decommissioned, this is not enough info because it is difficult to determine 
> which blocks are on their last replica. We have two design options:
>  # Add it to the existing report, on top of minLiveReplicas
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  *** Easier initial implementation
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  *** More difficult to add debugging information later on
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Summary: Better debuggability for datanode decommissioning  (was: Better 
debuggability for datanode decomissioning)

> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being 
> decommissioned, this is not enough info because it is difficult to determine 
> which blocks are on their last replica. We have two design options:
>  # Add it to the existing report, on top of minLiveReplicas
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  ** Disadvantages:
>  *** Add load to normal NN operation by checking every time a DN is 
> decommissioned
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decomissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 *** Initial debugging information would be more thorough
 ** Disadvantages:
 *** 
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.

  was:
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 *** 
 ** Disadvantages:
 *** 
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.


> Better debuggability for datanode decomissioning
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being 
> decommissioned, this is not enough info because it is difficult to determine 
> which blocks are on their last replica. We have two design options:
>  # Add it to the existing report, on top of minLiveReplicas
>  ** Advantages:
>  *** Initial debugging information would be more thorough
>  ** Disadvantages:
>  *** 
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decomissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 *** 
 ** Disadvantages:
 *** 
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.

  was:
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 ** Advantages:
 ***
 ** Disadvantages:
 *** 
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.


> Better debuggability for datanode decomissioning
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being 
> decommissioned, this is not enough info because it is difficult to determine 
> which blocks are on their last replica. We have two design options:
>  # Add it to the existing report, on top of minLiveReplicas
>  ** Advantages:
>  *** 
>  ** Disadvantages:
>  *** 
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decomissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** We wouldnt be adding more load to the NN in normal operation
 *** Much easier to extend in the future with more info
 ** Disadvantages:
 *** Getting the info on demand for this case will be much more expensive 
actually, cause we will have to find all the blocks on that DN, and then go 
through all the blocks again and count how many replicas we have etc.

  was:
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** we wouldnt be adding more load to the NN in normal operation
 *** much easier to extend in the future with more info
 ** Disadvantages:


> Better debuggability for datanode decomissioning
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being 
> decommissioned, this is not enough info because it is difficult to determine 
> which blocks are on their last replica. We have two design options:
>  # Add it to the existing report, on top of minLiveReplicas
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** We wouldnt be adding more load to the NN in normal operation
>  *** Much easier to extend in the future with more info
>  ** Disadvantages:
>  *** Getting the info on demand for this case will be much more expensive 
> actually, cause we will have to find all the blocks on that DN, and then go 
> through all the blocks again and count how many replicas we have etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decomissioning

2018-11-12 Thread Danny Becker (JIRA)


[jira] [Updated] (HDFS-14069) Better debuggability for datanode decomissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Summary: Better debuggability for datanode decomissioning  (was: Better 
debuggability for datanode decommissioning)

> Better debuggability for datanode decomissioning
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being decomission
> Add totalAccessibleBlocks to NumberReplicas
>  Add logic to track blocks that have less than the maxReplicasTracked
>  Add Map of low replica blockids to DatanodeDescriptor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decomissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
 # Add it to the existing report, on top of minLiveReplicas
 # Create a new api for querying more detailed info about one DN
 ** Advantages:
 *** we wouldnt be adding more load to the NN in normal operation
 *** much easier to extend in the future with more info
 ** Disadvantages:

  was:
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
\t


> Better debuggability for datanode decomissioning
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being 
> decommissioned, this is not enough info because it is difficult to determine 
> which blocks are on their last replica. We have two design options:
>  # Add it to the existing report, on top of minLiveReplicas
>  # Create a new api for querying more detailed info about one DN
>  ** Advantages:
>  *** we wouldnt be adding more load to the NN in normal operation
>  *** much easier to extend in the future with more info
>  ** Disadvantages:



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decomissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we only provide "minLiveReplicas" per DN that is being 
decommissioned, this is not enough info because it is difficult to determine 
which blocks are on their last replica. We have two design options:
\t

  was:
Currently, we only provide "minLiveReplicas" per DN that is being decomission

Add totalAccessibleBlocks to NumberReplicas
 Add logic to track blocks that have less than the maxReplicasTracked
 Add Map of low replica blockids to DatanodeDescriptor


> Better debuggability for datanode decomissioning
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being 
> decommissioned, this is not enough info because it is difficult to determine 
> which blocks are on their last replica. We have two design options:
> \t



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Currently, we only provide "minLiveReplicas" per DN that is being decomission

Add totalAccessibleBlocks to NumberReplicas
 Add logic to track blocks that have less than the maxReplicasTracked
 Add Map of low replica blockids to DatanodeDescriptor

  was:
 

Add totalAccessibleBlocks to NumberReplicas
 Add logic to track blocks that have less than the maxReplicasTracked
 Add Map of low replica blockids to DatanodeDescriptor


> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Currently, we only provide "minLiveReplicas" per DN that is being decomission
> Add totalAccessibleBlocks to NumberReplicas
>  Add logic to track blocks that have less than the maxReplicasTracked
>  Add Map of low replica blockids to DatanodeDescriptor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Add BlockIds to JMX info

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
 

Add totalAccessibleBlocks to NumberReplicas
 Add logic to track blocks that have less than the maxReplicasTracked
 Add Map of low replica blockids to DatanodeDescriptor

  was:
Add totalAccessibleBlocks to NumberReplicas
Add logic to track blocks that have less than the maxReplicasTracked
Add Map of low replica blockids to DatanodeDescriptor


> Add BlockIds to JMX info
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
>  
> Add totalAccessibleBlocks to NumberReplicas
>  Add logic to track blocks that have less than the maxReplicasTracked
>  Add Map of low replica blockids to DatanodeDescriptor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Better debuggability for datanode decommissioning

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Summary: Better debuggability for datanode decommissioning  (was: Add 
BlockIds to JMX info)

> Better debuggability for datanode decommissioning
> -
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
>  
> Add totalAccessibleBlocks to NumberReplicas
>  Add logic to track blocks that have less than the maxReplicasTracked
>  Add Map of low replica blockids to DatanodeDescriptor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Add BlockIds to JMX info

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Description: 
Add totalAccessibleBlocks to NumberReplicas
Add logic to track blocks that have less than the maxReplicasTracked
Add Map of low replica blockids to DatanodeDescriptor

> Add BlockIds to JMX info
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>
> Add totalAccessibleBlocks to NumberReplicas
> Add logic to track blocks that have less than the maxReplicasTracked
> Add Map of low replica blockids to DatanodeDescriptor



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14069) Add BlockIds to JMX info

2018-11-12 Thread Danny Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-14069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Becker updated HDFS-14069:

Attachment: HDFS-14069.000.patch

> Add BlockIds to JMX info
> 
>
> Key: HDFS-14069
> URL: https://issues.apache.org/jira/browse/HDFS-14069
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, hdfs, namenode
>Reporter: Danny Becker
>Priority: Major
> Attachments: HDFS-14069.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-14069) Add BlockIds to JMX info

2018-11-12 Thread Danny Becker (JIRA)
Danny Becker created HDFS-14069:
---

 Summary: Add BlockIds to JMX info
 Key: HDFS-14069
 URL: https://issues.apache.org/jira/browse/HDFS-14069
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: datanode, hdfs, namenode
Reporter: Danny Becker






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org