[jira] [Commented] (HDDS-4385) It would be nice if there is a search function using container ID on Recon Missing Container page

2020-10-26 Thread Glen Geng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221079#comment-17221079
 ] 

Glen Geng commented on HDDS-4385:
-

cc @[~avijayan]

> It would be nice if there is a search function using container ID on Recon 
> Missing Container page
> -
>
> Key: HDDS-4385
> URL: https://issues.apache.org/jira/browse/HDDS-4385
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Recon
>Reporter: Sammi Chen
>Priority: Major
> Attachments: image-2020-10-23-12-08-12-705.png
>
>
> In production cluster, there can be many missing containers to investigate.   
> It would be nice to have a search filer using Container ID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4355) Deleted container is marked as missing on recon UI

2020-10-26 Thread Glen Geng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221081#comment-17221081
 ] 

Glen Geng commented on HDDS-4355:
-

cc [~avijayan]

> Deleted container is marked as missing on recon UI
> --
>
> Key: HDDS-4355
> URL: https://issues.apache.org/jira/browse/HDDS-4355
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Sammi Chen
>Priority: Major
> Attachments: screenshot-1.png
>
>
> {noformat}
>  ~/ozoneenv/ozone]$ bin/ozone admin container info 104825
> Container id: 104825
> Pipeline id: 10955a24-2047-416f-85ac-94523cfe8d40
> Container State: DELETED
> Datanodes: []
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a 
slow Recon won't interfere communication between DN and SCM, or vice versa.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually, due to tens of thousands of threads are created.

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually, due to tens of thousands of threads are created.


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> private int getEndPointTaskThreadPoolSize() {
>   // TODO(runzhiwang): current only support one recon, if support multiple
>   //  recon in future reconServerCount should be the real numbe

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a 
slow Recon won't interfere the communication between DN and SCM, or vice versa.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually, due to tens of thousands of threads are created.

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon, a 
slow Recon won't interfere communication between DN and SCM, or vice versa.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually, due to tens of thousands of threads are created.


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> private int getEndPointTaskThreadPoolSize() {
>   // TODO(runzhiwang): current only support one recon, if

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually, due to tens of thousands of threads are created.

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually.


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> private int getEndPointTaskThreadPoolSize() {
>   // TODO(runzhiwang): current only support one recon, if support multiple
>   //  recon in future reconServerCount should be the real number of recon
>   int reconServerCount = 1;
>   int totalServerCount = reconServerCount;
>   try {
> totalServerCount += HddsUt

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*P.S.*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually.

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually.


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> private int getEndPointTaskThreadPoolSize() {
>   // TODO(runzhiwang): current only support one recon, if support multiple
>   //  recon in future reconServerCount should be the real number of recon
>   int reconServerCount = 1;
>   int totalServerCount = reconServerCount;
>   try {
> totalServerCount += HddsUtils.getSCMAddresses(conf).size();
>   } catch (Ex

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is a cached thread 
pool, if there exists a slow SCM/Recon, more and more threads will be created, 
and DN will OOM eventually.

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> private int getEndPointTaskThreadPoolSize() {
>   // TODO(runzhiwang): current only support one recon, if support multiple
>   //  recon in future reconServerCount should be the real number of recon
>   int reconServerCount = 1;
>   int totalServerCount = reconServerCount;
>   try {
> totalServerCount += HddsUtils.getSCMAddresses(conf).size();
>   } catch (Exception e) {
> LOG.error("Fail to get scm addresses", e);
>   }
>   return totalServerCount;
> }
> {code}
> meanwhile, c

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());

private int getEndPointTaskThreadPoolSize() {
  // TODO(runzhiwang): current only support one recon, if support multiple
  //  recon in future reconServerCount should be the real number of recon
  int reconServerCount = 1;
  int totalServerCount = reconServerCount;

  try {
totalServerCount += HddsUtils.getSCMAddresses(conf).size();
  } catch (Exception e) {
LOG.error("Fail to get scm addresses", e);
  }

  return totalServerCount;
}
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> private int getEndPointTaskThreadPoolSize() {
>   // TODO(runzhiwang): current only support one recon, if support multiple
>   //  recon in future reconServerCount should be the real number of recon
>   int reconServerCount = 1;
>   int totalServerCount = reconServerCount;
>   try {
> totalServerCount += HddsUtils.getSCMAddresses(conf).size();
>   } catch (Exception e) {
> LOG.error("Fail to get scm addresses", e);
>   }
>   return totalServerCount;
> }
> {code}
> meanwhile, current Recon has some performance issue, after running for hours, 
> it became slower and slower, and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.
>  
> *BTW:*
> The first edition for DatanodeStateMachine.executorService is 



--
This message was sen

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it became slower and slower, and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> {code}
> meanwhile, current Recon has some performance issue, after running for hours, 
> it became slower and slower, and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.
>  
> *BTW:*
> The first edition for DatanodeStateMachine.executorService is 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
 

meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:

 
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
 

meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> {code}
>  
> meanwhile, current Recon has some performance issue, after running for hours, 
> it stuck and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.
>  
> *BTW:*
> The first edition for DatanodeStateMachine.executorService is 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:

 
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
 

meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:

 
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
 

meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

 


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
>  
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> {code}
>  
> meanwhile, current Recon has some performance issue, after running for hours, 
> it stuck and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.
>  
> *BTW:*
> The first edition for DatanodeStateMachine.executorService is 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
 

meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

*BTW:*

The first edition for DatanodeStateMachine.executorService is 


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> {code}
> meanwhile, current Recon has some performance issue, after running for hours, 
> it stuck and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.
>  
> *BTW:*
> The first edition for DatanodeStateMachine.executorService is 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Affects Version/s: (was: 1.0.0)
   1.1.0

> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
>  
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> {code}
>  
> meanwhile, current Recon has some performance issue, after running for hours, 
> it stuck and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.
>  
> *BTW:*
> The first edition for DatanodeStateMachine.executorService is 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:

 
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
 

meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.

 

 

  was:
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:

 
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
 

meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

 

 

 


> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
>  
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> {code}
>  
> meanwhile, current Recon has some performance issue, after running for hours, 
> it stuck and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: 
In Tencent production environment, after start Recon for a while, we got 
warnings that all DNs become stale/dead at SCM side. After kill recon, all DNs 
become healthy in a very short time.

 

*The root cause is:*

1) EndpointStateMachine for SCM and that for Recon share the thread pool 
created by DatanodeStateMachine, which is a fixed size thread pool:

 
{code:java}
executorService = Executors.newFixedThreadPool(
getEndPointTaskThreadPoolSize(),
new ThreadFactoryBuilder()
.setNameFormat("Datanode State Machine Task Thread - %d").build());
{code}
 

meanwhile, current Recon has some performance issue, after running for hours, 
it stuck and crashed due to OOM. 

2) The communication between DN and Recon will soon exhaust all the threads in 
DatanodeStateMachine.executorService, there will be no available threads for DN 
to talk SCM. 

3) all DNs become stale/dead at SCM side.

 

*The fix is quite straightforward:*

 

 

 

> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> In Tencent production environment, after start Recon for a while, we got 
> warnings that all DNs become stale/dead at SCM side. After kill recon, all 
> DNs become healthy in a very short time.
>  
> *The root cause is:*
> 1) EndpointStateMachine for SCM and that for Recon share the thread pool 
> created by DatanodeStateMachine, which is a fixed size thread pool:
>  
> {code:java}
> executorService = Executors.newFixedThreadPool(
> getEndPointTaskThreadPoolSize(),
> new ThreadFactoryBuilder()
> .setNameFormat("Datanode State Machine Task Thread - %d").build());
> {code}
>  
> meanwhile, current Recon has some performance issue, after running for hours, 
> it stuck and crashed due to OOM. 
> 2) The communication between DN and Recon will soon exhaust all the threads 
> in DatanodeStateMachine.executorService, there will be no available threads 
> for DN to talk SCM. 
> 3) all DNs become stale/dead at SCM side.
>  
> *The fix is quite straightforward:*
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Description: (was: Configured goofys and s3g on different hosts and 
Fiotest writes files on the goofys mount point. Export AWS secrets on the s3g 
host. See a bunch of NPE in s3g logs.
 # Looks like missing AWS auth header could cause NPE. Looks like 
AWSSignatureProcessor.init() doesn't handle header missing which causes NPE.
 # Why it's missing AWS auth header is also unknown.

Note that there are files that have been successfully written into Ozone via 
goofys, while not all of them are succeeded.  

 

2020-10-13 11:18:43,425 [qtp1686100174-1238] ERROR 
org.apache.hadoop.ozone.s3.OzoneClientProducer: Error: 
org.jboss.weld.exceptions.WeldException: WELD-49: Unable to invoke public 
void org.apache.hadoop.ozone.s3.AWSSignatureProcessor.init() throws 
java.lang.Exception on org.apache.hadoop.ozone.s3.AWSSignatureProcessor@5535155b
 at 
org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.invokeMethods(DefaultLifecycleCallbackInvoker.java:99)
 at 
org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.postConstruct(DefaultLifecycleCallbackInvoker.java:80)
 at 
org.jboss.weld.injection.producer.BasicInjectionTarget.postConstruct(BasicInjectionTarget.java:122)
 at 
org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.postConstruct(CdiComponentProvider.java:887)
 at org.jboss.weld.bean.ManagedBean.create(ManagedBean.java:162)
 at org.jboss.weld.context.AbstractContext.get(AbstractContext.java:96)
 at 
org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
 at 
org.jboss.weld.bean.ContextualInstanceStrategy$CachingContextualInstanceStrategy.get(ContextualInstanceStrategy.java:177)
 at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
 at 
org.jboss.weld.bean.proxy.ContextBeanInstance.getInstance(ContextBeanInstance.java:99)
 at 
org.jboss.weld.bean.proxy.ProxyMethodHandler.getInstance(ProxyMethodHandler.java:125)
 at 
org.apache.hadoop.ozone.s3.AWSSignatureProcessor$Proxy$_$$_WeldClientProxy.getAwsAccessId(Unknown
 Source)
 at 
org.apache.hadoop.ozone.s3.OzoneClientProducer.getClient(OzoneClientProducer.java:79)
 at 
org.apache.hadoop.ozone.s3.OzoneClientProducer.createClient(OzoneClientProducer.java:68)
 at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:88)
 at 
org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:78)
 at 
org.jboss.weld.injection.producer.ProducerMethodProducer.produce(ProducerMethodProducer.java:100)
 at 
org.jboss.weld.injection.producer.AbstractMemberProducer.produce(AbstractMemberProducer.java:161)
 at 
org.jboss.weld.bean.AbstractProducerBean.create(AbstractProducerBean.java:180)
 at 
org.jboss.weld.context.unbound.DependentContextImpl.get(DependentContextImpl.java:70)
 at 
org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
 at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
 at 
org.jboss.weld.manager.BeanManagerImpl.getReference(BeanManagerImpl.java:785)
 at 
org.jboss.weld.manager.BeanManagerImpl.getInjectableReference(BeanManagerImpl.java:885)
 at 
org.jboss.weld.injection.FieldInjectionPoint.inject(FieldInjectionPoint.java:92)
 at org.jboss.weld.util.Beans.injectBoundFields(Beans.java:358)
 at org.jboss.weld.util.Beans.injectFieldsAndInitializers(Beans.java:369)
 at 
org.jboss.weld.injection.producer.ResourceInjector$1.proceed(ResourceInjector.java:70)
 at 
org.jboss.weld.injection.InjectionContextImpl.run(InjectionContextImpl.java:48)
 at 
org.jboss.weld.injection.producer.ResourceInjector.inject(ResourceInjector.java:72)
 at 
org.jboss.weld.injection.producer.BasicInjectionTarget.inject(BasicInjectionTarget.java:117)
 at 
org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.inject(CdiComponentProvider.java:873)
 at org.jboss.weld.bean.ManagedBean.create(ManagedBean.java:159)
 at 
org.jboss.weld.context.unbound.DependentContextImpl.get(DependentContextImpl.java:70)
 at 
org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
 at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
 at 
org.jboss.weld.manager.BeanManagerImpl.getReference(BeanManagerImpl.java:785)
 at 
org.jboss.weld.manager.BeanManagerImpl.getReference(BeanManagerImpl.java:808)
 at 
org.jboss.weld.util.ForwardingBeanManager.getReference(ForwardingBeanManager.java:61

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Summary: Each EndpointStateMachine uses its own thread pool to talk with 
SCM/Recon  (was: EndpointStateMachine)

> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>  Labels: pull-request-available
>
> Configured goofys and s3g on different hosts and Fiotest writes files on the 
> goofys mount point. Export AWS secrets on the s3g host. See a bunch of NPE in 
> s3g logs.
>  # Looks like missing AWS auth header could cause NPE. Looks like 
> AWSSignatureProcessor.init() doesn't handle header missing which causes NPE.
>  # Why it's missing AWS auth header is also unknown.
> Note that there are files that have been successfully written into Ozone via 
> goofys, while not all of them are succeeded.  
>  
> 2020-10-13 11:18:43,425 [qtp1686100174-1238] ERROR 
> org.apache.hadoop.ozone.s3.OzoneClientProducer: Error: 
> org.jboss.weld.exceptions.WeldException: WELD-49: Unable to invoke public 
> void org.apache.hadoop.ozone.s3.AWSSignatureProcessor.init() throws 
> java.lang.Exception on 
> org.apache.hadoop.ozone.s3.AWSSignatureProcessor@5535155b
>  at 
> org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.invokeMethods(DefaultLifecycleCallbackInvoker.java:99)
>  at 
> org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.postConstruct(DefaultLifecycleCallbackInvoker.java:80)
>  at 
> org.jboss.weld.injection.producer.BasicInjectionTarget.postConstruct(BasicInjectionTarget.java:122)
>  at 
> org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.postConstruct(CdiComponentProvider.java:887)
>  at org.jboss.weld.bean.ManagedBean.create(ManagedBean.java:162)
>  at org.jboss.weld.context.AbstractContext.get(AbstractContext.java:96)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$CachingContextualInstanceStrategy.get(ContextualInstanceStrategy.java:177)
>  at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
>  at 
> org.jboss.weld.bean.proxy.ContextBeanInstance.getInstance(ContextBeanInstance.java:99)
>  at 
> org.jboss.weld.bean.proxy.ProxyMethodHandler.getInstance(ProxyMethodHandler.java:125)
>  at 
> org.apache.hadoop.ozone.s3.AWSSignatureProcessor$Proxy$_$$_WeldClientProxy.getAwsAccessId(Unknown
>  Source)
>  at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.getClient(OzoneClientProducer.java:79)
>  at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.createClient(OzoneClientProducer.java:68)
>  at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:88)
>  at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:78)
>  at 
> org.jboss.weld.injection.producer.ProducerMethodProducer.produce(ProducerMethodProducer.java:100)
>  at 
> org.jboss.weld.injection.producer.AbstractMemberProducer.produce(AbstractMemberProducer.java:161)
>  at 
> org.jboss.weld.bean.AbstractProducerBean.create(AbstractProducerBean.java:180)
>  at 
> org.jboss.weld.context.unbound.DependentContextImpl.get(DependentContextImpl.java:70)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
>  at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
>  at 
> org.jboss.weld.manager.BeanManagerImpl.getReference(BeanManagerImpl.java:785)
>  at 
> org.jboss.weld.manager.BeanManagerImpl.getInjectableReference(BeanManagerImpl.java:885)
>  at 
> org.jboss.weld.injection.FieldInjectionPoint.inject(FieldInjectionPoint.java:92)
>  at org.jboss.weld.util.Beans.injectBoundFields(Beans.java:358)
>  at org.jboss.weld.util.Beans.injectFieldsAndInitializers(Beans.java:369)
>  at 
> org.jboss.weld.injection.producer.ResourceInjector$1.proceed(ResourceInjector.java:70)
>  at 
> org.jboss.weld.injection.InjectionContextImpl.run(InjectionContextImpl.java:48)
>  at 
> org.jboss.weld.injection.producer.ResourceInjector.inject(ResourceInjector.java:72)
>  at 
> org.jboss.weld.injection.producer.BasicInjectionTarget.inject(BasicInjectionTarget.java:117

[jira] [Updated] (HDDS-4386) Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Labels:   (was: pull-request-available)

> Each EndpointStateMachine uses its own thread pool to talk with SCM/Recon
> -
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> Configured goofys and s3g on different hosts and Fiotest writes files on the 
> goofys mount point. Export AWS secrets on the s3g host. See a bunch of NPE in 
> s3g logs.
>  # Looks like missing AWS auth header could cause NPE. Looks like 
> AWSSignatureProcessor.init() doesn't handle header missing which causes NPE.
>  # Why it's missing AWS auth header is also unknown.
> Note that there are files that have been successfully written into Ozone via 
> goofys, while not all of them are succeeded.  
>  
> 2020-10-13 11:18:43,425 [qtp1686100174-1238] ERROR 
> org.apache.hadoop.ozone.s3.OzoneClientProducer: Error: 
> org.jboss.weld.exceptions.WeldException: WELD-49: Unable to invoke public 
> void org.apache.hadoop.ozone.s3.AWSSignatureProcessor.init() throws 
> java.lang.Exception on 
> org.apache.hadoop.ozone.s3.AWSSignatureProcessor@5535155b
>  at 
> org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.invokeMethods(DefaultLifecycleCallbackInvoker.java:99)
>  at 
> org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.postConstruct(DefaultLifecycleCallbackInvoker.java:80)
>  at 
> org.jboss.weld.injection.producer.BasicInjectionTarget.postConstruct(BasicInjectionTarget.java:122)
>  at 
> org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.postConstruct(CdiComponentProvider.java:887)
>  at org.jboss.weld.bean.ManagedBean.create(ManagedBean.java:162)
>  at org.jboss.weld.context.AbstractContext.get(AbstractContext.java:96)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$CachingContextualInstanceStrategy.get(ContextualInstanceStrategy.java:177)
>  at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
>  at 
> org.jboss.weld.bean.proxy.ContextBeanInstance.getInstance(ContextBeanInstance.java:99)
>  at 
> org.jboss.weld.bean.proxy.ProxyMethodHandler.getInstance(ProxyMethodHandler.java:125)
>  at 
> org.apache.hadoop.ozone.s3.AWSSignatureProcessor$Proxy$_$$_WeldClientProxy.getAwsAccessId(Unknown
>  Source)
>  at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.getClient(OzoneClientProducer.java:79)
>  at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.createClient(OzoneClientProducer.java:68)
>  at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:88)
>  at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:78)
>  at 
> org.jboss.weld.injection.producer.ProducerMethodProducer.produce(ProducerMethodProducer.java:100)
>  at 
> org.jboss.weld.injection.producer.AbstractMemberProducer.produce(AbstractMemberProducer.java:161)
>  at 
> org.jboss.weld.bean.AbstractProducerBean.create(AbstractProducerBean.java:180)
>  at 
> org.jboss.weld.context.unbound.DependentContextImpl.get(DependentContextImpl.java:70)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
>  at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
>  at 
> org.jboss.weld.manager.BeanManagerImpl.getReference(BeanManagerImpl.java:785)
>  at 
> org.jboss.weld.manager.BeanManagerImpl.getInjectableReference(BeanManagerImpl.java:885)
>  at 
> org.jboss.weld.injection.FieldInjectionPoint.inject(FieldInjectionPoint.java:92)
>  at org.jboss.weld.util.Beans.injectBoundFields(Beans.java:358)
>  at org.jboss.weld.util.Beans.injectFieldsAndInitializers(Beans.java:369)
>  at 
> org.jboss.weld.injection.producer.ResourceInjector$1.proceed(ResourceInjector.java:70)
>  at 
> org.jboss.weld.injection.InjectionContextImpl.run(InjectionContextImpl.java:48)
>  at 
> org.jboss.weld.injection.producer.ResourceInjector.inject(ResourceInjector.java:72)
>  at 
> org.jboss.weld.injection.producer.BasicInjectionTarget.inject(BasicInjectionTarget.java:117)
>  at 
> org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.inject(CdiCom

[jira] [Assigned] (HDDS-4386) CLONE - Ozone S3 gateway throws NPE with goofys

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng reassigned HDDS-4386:
---

Assignee: Glen Geng  (was: Li Cheng)

> CLONE - Ozone S3 gateway throws NPE with goofys
> ---
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>  Labels: pull-request-available
>
> Configured goofys and s3g on different hosts and Fiotest writes files on the 
> goofys mount point. Export AWS secrets on the s3g host. See a bunch of NPE in 
> s3g logs.
>  # Looks like missing AWS auth header could cause NPE. Looks like 
> AWSSignatureProcessor.init() doesn't handle header missing which causes NPE.
>  # Why it's missing AWS auth header is also unknown.
> Note that there are files that have been successfully written into Ozone via 
> goofys, while not all of them are succeeded.  
>  
> 2020-10-13 11:18:43,425 [qtp1686100174-1238] ERROR 
> org.apache.hadoop.ozone.s3.OzoneClientProducer: Error: 
> org.jboss.weld.exceptions.WeldException: WELD-49: Unable to invoke public 
> void org.apache.hadoop.ozone.s3.AWSSignatureProcessor.init() throws 
> java.lang.Exception on 
> org.apache.hadoop.ozone.s3.AWSSignatureProcessor@5535155b
>  at 
> org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.invokeMethods(DefaultLifecycleCallbackInvoker.java:99)
>  at 
> org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.postConstruct(DefaultLifecycleCallbackInvoker.java:80)
>  at 
> org.jboss.weld.injection.producer.BasicInjectionTarget.postConstruct(BasicInjectionTarget.java:122)
>  at 
> org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.postConstruct(CdiComponentProvider.java:887)
>  at org.jboss.weld.bean.ManagedBean.create(ManagedBean.java:162)
>  at org.jboss.weld.context.AbstractContext.get(AbstractContext.java:96)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$CachingContextualInstanceStrategy.get(ContextualInstanceStrategy.java:177)
>  at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
>  at 
> org.jboss.weld.bean.proxy.ContextBeanInstance.getInstance(ContextBeanInstance.java:99)
>  at 
> org.jboss.weld.bean.proxy.ProxyMethodHandler.getInstance(ProxyMethodHandler.java:125)
>  at 
> org.apache.hadoop.ozone.s3.AWSSignatureProcessor$Proxy$_$$_WeldClientProxy.getAwsAccessId(Unknown
>  Source)
>  at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.getClient(OzoneClientProducer.java:79)
>  at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.createClient(OzoneClientProducer.java:68)
>  at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:88)
>  at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:78)
>  at 
> org.jboss.weld.injection.producer.ProducerMethodProducer.produce(ProducerMethodProducer.java:100)
>  at 
> org.jboss.weld.injection.producer.AbstractMemberProducer.produce(AbstractMemberProducer.java:161)
>  at 
> org.jboss.weld.bean.AbstractProducerBean.create(AbstractProducerBean.java:180)
>  at 
> org.jboss.weld.context.unbound.DependentContextImpl.get(DependentContextImpl.java:70)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
>  at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
>  at 
> org.jboss.weld.manager.BeanManagerImpl.getReference(BeanManagerImpl.java:785)
>  at 
> org.jboss.weld.manager.BeanManagerImpl.getInjectableReference(BeanManagerImpl.java:885)
>  at 
> org.jboss.weld.injection.FieldInjectionPoint.inject(FieldInjectionPoint.java:92)
>  at org.jboss.weld.util.Beans.injectBoundFields(Beans.java:358)
>  at org.jboss.weld.util.Beans.injectFieldsAndInitializers(Beans.java:369)
>  at 
> org.jboss.weld.injection.producer.ResourceInjector$1.proceed(ResourceInjector.java:70)
>  at 
> org.jboss.weld.injection.InjectionContextImpl.run(InjectionContextImpl.java:48)
>  at 
> org.jboss.weld.injection.producer.ResourceInjector.inject(ResourceInjector.java:72)
>  at 
> org.jboss.weld.injection.producer.BasicInjectionTarget.inject(BasicInjectionTarget.java:117)
>  at 
> org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.inject(CdiCompo

[jira] [Updated] (HDDS-4386) EndpointStateMachine

2020-10-22 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4386:

Summary: EndpointStateMachine  (was: CLONE - Ozone S3 gateway throws NPE 
with goofys)

> EndpointStateMachine
> 
>
> Key: HDDS-4386
> URL: https://issues.apache.org/jira/browse/HDDS-4386
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 1.0.0
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>  Labels: pull-request-available
>
> Configured goofys and s3g on different hosts and Fiotest writes files on the 
> goofys mount point. Export AWS secrets on the s3g host. See a bunch of NPE in 
> s3g logs.
>  # Looks like missing AWS auth header could cause NPE. Looks like 
> AWSSignatureProcessor.init() doesn't handle header missing which causes NPE.
>  # Why it's missing AWS auth header is also unknown.
> Note that there are files that have been successfully written into Ozone via 
> goofys, while not all of them are succeeded.  
>  
> 2020-10-13 11:18:43,425 [qtp1686100174-1238] ERROR 
> org.apache.hadoop.ozone.s3.OzoneClientProducer: Error: 
> org.jboss.weld.exceptions.WeldException: WELD-49: Unable to invoke public 
> void org.apache.hadoop.ozone.s3.AWSSignatureProcessor.init() throws 
> java.lang.Exception on 
> org.apache.hadoop.ozone.s3.AWSSignatureProcessor@5535155b
>  at 
> org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.invokeMethods(DefaultLifecycleCallbackInvoker.java:99)
>  at 
> org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.postConstruct(DefaultLifecycleCallbackInvoker.java:80)
>  at 
> org.jboss.weld.injection.producer.BasicInjectionTarget.postConstruct(BasicInjectionTarget.java:122)
>  at 
> org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.postConstruct(CdiComponentProvider.java:887)
>  at org.jboss.weld.bean.ManagedBean.create(ManagedBean.java:162)
>  at org.jboss.weld.context.AbstractContext.get(AbstractContext.java:96)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$CachingContextualInstanceStrategy.get(ContextualInstanceStrategy.java:177)
>  at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
>  at 
> org.jboss.weld.bean.proxy.ContextBeanInstance.getInstance(ContextBeanInstance.java:99)
>  at 
> org.jboss.weld.bean.proxy.ProxyMethodHandler.getInstance(ProxyMethodHandler.java:125)
>  at 
> org.apache.hadoop.ozone.s3.AWSSignatureProcessor$Proxy$_$$_WeldClientProxy.getAwsAccessId(Unknown
>  Source)
>  at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.getClient(OzoneClientProducer.java:79)
>  at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.createClient(OzoneClientProducer.java:68)
>  at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:88)
>  at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:78)
>  at 
> org.jboss.weld.injection.producer.ProducerMethodProducer.produce(ProducerMethodProducer.java:100)
>  at 
> org.jboss.weld.injection.producer.AbstractMemberProducer.produce(AbstractMemberProducer.java:161)
>  at 
> org.jboss.weld.bean.AbstractProducerBean.create(AbstractProducerBean.java:180)
>  at 
> org.jboss.weld.context.unbound.DependentContextImpl.get(DependentContextImpl.java:70)
>  at 
> org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
>  at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
>  at 
> org.jboss.weld.manager.BeanManagerImpl.getReference(BeanManagerImpl.java:785)
>  at 
> org.jboss.weld.manager.BeanManagerImpl.getInjectableReference(BeanManagerImpl.java:885)
>  at 
> org.jboss.weld.injection.FieldInjectionPoint.inject(FieldInjectionPoint.java:92)
>  at org.jboss.weld.util.Beans.injectBoundFields(Beans.java:358)
>  at org.jboss.weld.util.Beans.injectFieldsAndInitializers(Beans.java:369)
>  at 
> org.jboss.weld.injection.producer.ResourceInjector$1.proceed(ResourceInjector.java:70)
>  at 
> org.jboss.weld.injection.InjectionContextImpl.run(InjectionContextImpl.java:48)
>  at 
> org.jboss.weld.injection.producer.ResourceInjector.inject(ResourceInjector.java:72)
>  at 
> org.jboss.weld.injection.producer.BasicInjectionTarget.inject(BasicInjectionTarget.java:117)
>  at 
> org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.inject(CdiComponentProvide

[jira] [Created] (HDDS-4386) CLONE - Ozone S3 gateway throws NPE with goofys

2020-10-22 Thread Glen Geng (Jira)
Glen Geng created HDDS-4386:
---

 Summary: CLONE - Ozone S3 gateway throws NPE with goofys
 Key: HDDS-4386
 URL: https://issues.apache.org/jira/browse/HDDS-4386
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Glen Geng
Assignee: Li Cheng


Configured goofys and s3g on different hosts and Fiotest writes files on the 
goofys mount point. Export AWS secrets on the s3g host. See a bunch of NPE in 
s3g logs.
 # Looks like missing AWS auth header could cause NPE. Looks like 
AWSSignatureProcessor.init() doesn't handle header missing which causes NPE.
 # Why it's missing AWS auth header is also unknown.

Note that there are files that have been successfully written into Ozone via 
goofys, while not all of them are succeeded.  

 

2020-10-13 11:18:43,425 [qtp1686100174-1238] ERROR 
org.apache.hadoop.ozone.s3.OzoneClientProducer: Error: 
org.jboss.weld.exceptions.WeldException: WELD-49: Unable to invoke public 
void org.apache.hadoop.ozone.s3.AWSSignatureProcessor.init() throws 
java.lang.Exception on org.apache.hadoop.ozone.s3.AWSSignatureProcessor@5535155b
 at 
org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.invokeMethods(DefaultLifecycleCallbackInvoker.java:99)
 at 
org.jboss.weld.injection.producer.DefaultLifecycleCallbackInvoker.postConstruct(DefaultLifecycleCallbackInvoker.java:80)
 at 
org.jboss.weld.injection.producer.BasicInjectionTarget.postConstruct(BasicInjectionTarget.java:122)
 at 
org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.postConstruct(CdiComponentProvider.java:887)
 at org.jboss.weld.bean.ManagedBean.create(ManagedBean.java:162)
 at org.jboss.weld.context.AbstractContext.get(AbstractContext.java:96)
 at 
org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
 at 
org.jboss.weld.bean.ContextualInstanceStrategy$CachingContextualInstanceStrategy.get(ContextualInstanceStrategy.java:177)
 at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
 at 
org.jboss.weld.bean.proxy.ContextBeanInstance.getInstance(ContextBeanInstance.java:99)
 at 
org.jboss.weld.bean.proxy.ProxyMethodHandler.getInstance(ProxyMethodHandler.java:125)
 at 
org.apache.hadoop.ozone.s3.AWSSignatureProcessor$Proxy$_$$_WeldClientProxy.getAwsAccessId(Unknown
 Source)
 at 
org.apache.hadoop.ozone.s3.OzoneClientProducer.getClient(OzoneClientProducer.java:79)
 at 
org.apache.hadoop.ozone.s3.OzoneClientProducer.createClient(OzoneClientProducer.java:68)
 at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:88)
 at 
org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:78)
 at 
org.jboss.weld.injection.producer.ProducerMethodProducer.produce(ProducerMethodProducer.java:100)
 at 
org.jboss.weld.injection.producer.AbstractMemberProducer.produce(AbstractMemberProducer.java:161)
 at 
org.jboss.weld.bean.AbstractProducerBean.create(AbstractProducerBean.java:180)
 at 
org.jboss.weld.context.unbound.DependentContextImpl.get(DependentContextImpl.java:70)
 at 
org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
 at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
 at 
org.jboss.weld.manager.BeanManagerImpl.getReference(BeanManagerImpl.java:785)
 at 
org.jboss.weld.manager.BeanManagerImpl.getInjectableReference(BeanManagerImpl.java:885)
 at 
org.jboss.weld.injection.FieldInjectionPoint.inject(FieldInjectionPoint.java:92)
 at org.jboss.weld.util.Beans.injectBoundFields(Beans.java:358)
 at org.jboss.weld.util.Beans.injectFieldsAndInitializers(Beans.java:369)
 at 
org.jboss.weld.injection.producer.ResourceInjector$1.proceed(ResourceInjector.java:70)
 at 
org.jboss.weld.injection.InjectionContextImpl.run(InjectionContextImpl.java:48)
 at 
org.jboss.weld.injection.producer.ResourceInjector.inject(ResourceInjector.java:72)
 at 
org.jboss.weld.injection.producer.BasicInjectionTarget.inject(BasicInjectionTarget.java:117)
 at 
org.glassfish.jersey.ext.cdi1x.internal.CdiComponentProvider$InjectionManagerInjectedCdiTarget.inject(CdiComponentProvider.java:873)
 at org.jboss.weld.bean.ManagedBean.create(ManagedBean.java:159)
 at 
org.jboss.weld.context.unbound.DependentContextImpl.get(DependentContextImpl.java:70)
 at 
org.jboss.weld.bean.ContextualInstanceStrategy$DefaultContextualInstanceStrategy.get(ContextualInstanceStrategy.java:100)
 at org.jboss.weld.bean.ContextualInstance.get(ContextualInstance.java:50)
 at 
org.jboss.weld.manager.BeanManagerImpl.getReference(BeanMa

[jira] [Updated] (HDDS-4365) SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine

2020-10-21 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4365:

Description: 
in SCMBlockLocationFailoverProxyProvider,

currently it is
{code:java}
private ScmBlockLocationProtocolPB createSCMProxy(
InetSocketAddress scmAddress) throws IOException {
  ...
  RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocol.class,
  ProtobufRpcEngine.class);
  ...{code}
 it should be 
{code:java}
private ScmBlockLocationProtocolPB createSCMProxy(
InetSocketAddress scmAddress) throws IOException {
  ...
  RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
  ProtobufRpcEngine.class);
  ...{code}
 

FYi, according to non-HA version
{code:java}
private static ScmBlockLocationProtocol getScmBlockClient(
OzoneConfiguration conf) throws IOException {
  RPC.setProtocolEngine(conf, ScmBlockLocationProtocolPB.class,
  ProtobufRpcEngine.class);
  long scmVersion =
  RPC.getProtocolVersion(ScmBlockLocationProtocolPB.class);
  InetSocketAddress scmBlockAddress =
  getScmAddressForBlockClients(conf);
  ScmBlockLocationProtocolClientSideTranslatorPB scmBlockLocationClient =
  new ScmBlockLocationProtocolClientSideTranslatorPB(
  RPC.getProxy(ScmBlockLocationProtocolPB.class, scmVersion,
  scmBlockAddress, UserGroupInformation.getCurrentUser(), conf,
  NetUtils.getDefaultSocketFactory(conf),
  Client.getRpcTimeout(conf)));
  return TracingUtil
  .createProxy(scmBlockLocationClient, ScmBlockLocationProtocol.class,
  conf);
}
{code}

  was:
in SCMBlockLocationFailoverProxyProvider, currently it is
{code:java}
private ScmBlockLocationProtocolPB createSCMProxy(
InetSocketAddress scmAddress) throws IOException {
  ...
  RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocol.class,
  ProtobufRpcEngine.class);
  ...{code}
 it should be 
{code:java}
private ScmBlockLocationProtocolPB createSCMProxy(
InetSocketAddress scmAddress) throws IOException {
  ...
  RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
  ProtobufRpcEngine.class);
  ...{code}
 


> SCMBlockLocationFailoverProxyProvider should use 
> ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine
> --
>
> Key: HDDS-4365
> URL: https://issues.apache.org/jira/browse/HDDS-4365
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>
> in SCMBlockLocationFailoverProxyProvider,
> currently it is
> {code:java}
> private ScmBlockLocationProtocolPB createSCMProxy(
> InetSocketAddress scmAddress) throws IOException {
>   ...
>   RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocol.class,
>   ProtobufRpcEngine.class);
>   ...{code}
>  it should be 
> {code:java}
> private ScmBlockLocationProtocolPB createSCMProxy(
> InetSocketAddress scmAddress) throws IOException {
>   ...
>   RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
>   ProtobufRpcEngine.class);
>   ...{code}
>  
> FYi, according to non-HA version
> {code:java}
> private static ScmBlockLocationProtocol getScmBlockClient(
> OzoneConfiguration conf) throws IOException {
>   RPC.setProtocolEngine(conf, ScmBlockLocationProtocolPB.class,
>   ProtobufRpcEngine.class);
>   long scmVersion =
>   RPC.getProtocolVersion(ScmBlockLocationProtocolPB.class);
>   InetSocketAddress scmBlockAddress =
>   getScmAddressForBlockClients(conf);
>   ScmBlockLocationProtocolClientSideTranslatorPB scmBlockLocationClient =
>   new ScmBlockLocationProtocolClientSideTranslatorPB(
>   RPC.getProxy(ScmBlockLocationProtocolPB.class, scmVersion,
>   scmBlockAddress, UserGroupInformation.getCurrentUser(), conf,
>   NetUtils.getDefaultSocketFactory(conf),
>   Client.getRpcTimeout(conf)));
>   return TracingUtil
>   .createProxy(scmBlockLocationClient, ScmBlockLocationProtocol.class,
>   conf);
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-4365) SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine

2020-10-21 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng reassigned HDDS-4365:
---

Assignee: Glen Geng

> SCMBlockLocationFailoverProxyProvider should use 
> ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine
> --
>
> Key: HDDS-4365
> URL: https://issues.apache.org/jira/browse/HDDS-4365
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>
> in SCMBlockLocationFailoverProxyProvider, currently it is
> {code:java}
> private ScmBlockLocationProtocolPB createSCMProxy(
> InetSocketAddress scmAddress) throws IOException {
>   ...
>   RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocol.class,
>   ProtobufRpcEngine.class);
>   ...{code}
>  it should be 
> {code:java}
> private ScmBlockLocationProtocolPB createSCMProxy(
> InetSocketAddress scmAddress) throws IOException {
>   ...
>   RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
>   ProtobufRpcEngine.class);
>   ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4365) SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine

2020-10-21 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4365:

Description: 
in SCMBlockLocationFailoverProxyProvider, currently it is
{code:java}
private ScmBlockLocationProtocolPB createSCMProxy(
InetSocketAddress scmAddress) throws IOException {
  ...
  RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocol.class,
  ProtobufRpcEngine.class);
  ...{code}
 it should be 
{code:java}
private ScmBlockLocationProtocolPB createSCMProxy(
InetSocketAddress scmAddress) throws IOException {
  ...
  RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
  ProtobufRpcEngine.class);
  ...{code}
 

  was:
in SCMBlockLocationFailoverProxyProvider, it should be 
{code:java}
private ScmBlockLocationProtocolPB createSCMProxy(
InetSocketAddress scmAddress) throws IOException {
  ...
  RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
  ProtobufRpcEngine.class);
  ...{code}
 


> SCMBlockLocationFailoverProxyProvider should use 
> ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine
> --
>
> Key: HDDS-4365
> URL: https://issues.apache.org/jira/browse/HDDS-4365
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Priority: Minor
>
> in SCMBlockLocationFailoverProxyProvider, currently it is
> {code:java}
> private ScmBlockLocationProtocolPB createSCMProxy(
> InetSocketAddress scmAddress) throws IOException {
>   ...
>   RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocol.class,
>   ProtobufRpcEngine.class);
>   ...{code}
>  it should be 
> {code:java}
> private ScmBlockLocationProtocolPB createSCMProxy(
> InetSocketAddress scmAddress) throws IOException {
>   ...
>   RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
>   ProtobufRpcEngine.class);
>   ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4365) SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine

2020-10-21 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4365:

Description: 
in SCMBlockLocationFailoverProxyProvider, it should be 
{code:java}
private ScmBlockLocationProtocolPB createSCMProxy(
InetSocketAddress scmAddress) throws IOException {
  ...
  RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
  ProtobufRpcEngine.class);
  ...{code}
 

  was:
SCM ServiceManager is going to control all the SCM background service so that 
they are only serving as the leader. 

ServiceManager also would bootstrap all the background services and protocol 
servers. 

It also needs to do validation steps when the SCM is up as the leader.


> SCMBlockLocationFailoverProxyProvider should use 
> ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine
> --
>
> Key: HDDS-4365
> URL: https://issues.apache.org/jira/browse/HDDS-4365
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Priority: Minor
>
> in SCMBlockLocationFailoverProxyProvider, it should be 
> {code:java}
> private ScmBlockLocationProtocolPB createSCMProxy(
> InetSocketAddress scmAddress) throws IOException {
>   ...
>   RPC.setProtocolEngine(hadoopConf, ScmBlockLocationProtocolPB.class,
>   ProtobufRpcEngine.class);
>   ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4365) SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine

2020-10-21 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4365:

Priority: Minor  (was: Major)

> SCMBlockLocationFailoverProxyProvider should use 
> ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine
> --
>
> Key: HDDS-4365
> URL: https://issues.apache.org/jira/browse/HDDS-4365
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Priority: Minor
>
> SCM ServiceManager is going to control all the SCM background service so that 
> they are only serving as the leader. 
> ServiceManager also would bootstrap all the background services and protocol 
> servers. 
> It also needs to do validation steps when the SCM is up as the leader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4365) SCMBlockLocationFailoverProxyProvider should use ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine

2020-10-21 Thread Glen Geng (Jira)
Glen Geng created HDDS-4365:
---

 Summary: SCMBlockLocationFailoverProxyProvider should use 
ScmBlockLocationProtocolPB.class in RPC.setProtocolEngine
 Key: HDDS-4365
 URL: https://issues.apache.org/jira/browse/HDDS-4365
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
  Components: SCM
Reporter: Glen Geng


SCM ServiceManager is going to control all the SCM background service so that 
they are only serving as the leader. 

ServiceManager also would bootstrap all the background services and protocol 
servers. 

It also needs to do validation steps when the SCM is up as the leader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-4351) DN crash while RatisApplyTransactionExecutor tries to putBlock to rocksDB

2020-10-15 Thread Glen Geng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215219#comment-17215219
 ] 

Glen Geng commented on HDDS-4351:
-

Hello [~erose] [~arp] [~bharat]

As requested by Ethan, I scheduled a long run test on latest master with 
HDDS-4327, good news is there is no DN crash during the whole test.

De facto, the try-with-resource on BatchOperation fixed the crash in RocksDB. I 
suggest to close this Jira and re-open it if we see similar crash again in 
future.

> DN crash while RatisApplyTransactionExecutor tries to putBlock to rocksDB
> -
>
> Key: HDDS-4351
> URL: https://issues.apache.org/jira/browse/HDDS-4351
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 1.1.0
>Reporter: Glen Geng
>Assignee: Ethan Rose
>Priority: Major
>
> In Tencent, we monthly pick up the latest mater and deploy them to our 
> production environment.
> This time, we tested c956ce6 (HDDS-4262 [. Use ClientID and CallID from Rpc 
> Client to detect retry 
> re…|https://github.com/apache/hadoop-ozone/commit/c956ce6b7537a0286c01b15d496a7ffeba90]
>  ), encountered frequently crash in datanode while putBlock.
>  
> *The setup* is 3 DN, each engage in 8 pipelines. 1 OM 1 SCM and 1 Gateway.
> *The repo procedure* is simple. Continually writing 10GB size files to s3g 
> from python (the aws lib boto3), after write tens of files, DN might crash 
> while applying putBlock operations.
> After running the test on the version that revert HDDS-3869 for 10 hours,,  
> no DN crash occurred.
> Will schedule a long run test on the latest master with HDDS-4327, to check 
> if adding try-with-resource to BatchOperation could fix the crash issue.
>  
> *Example1: segment fault while putBlock.*
> {code:java}
> Current thread (0x7eff34524000):  JavaThread 
> "RatisApplyTransactionExecutor 9" daemon [_thread_in_native, id=20401, 
> stack(0x7efef4a14000,0x7efef4b15000)]siginfo: si_signo: 11 (SIGSEGV), 
> si_code: 2 (SEGV_ACCERR), si_addr: 0x7eff37eb9000Registers:
> RAX=0x7efe8bbfb024, RBX=0x, RCX=0x, 
> RDX=0x007688e4
> RSP=0x7efef4b11e38, RBP=0x7efef4b11f60, RSI=0x7eff37eb8feb, 
> RDI=0x7efe8f892640
> R8 =0x7efe8bbfb024, R9 =0x0080, R10=0x0022, 
> R11=0x1000
> R12=0x7efef4b12100, R13=0x7eff340badc0, R14=0x7eff340bb7b0, 
> R15=0x0440
> RIP=0x7eff4fa04bae, EFLAGS=0x00010206, CSGSFS=0x0033, 
> ERR=0x0004
>   TRAPNO=0x000eStack: [0x7efef4a14000,0x7efef4b15000],  
> sp=0x7efef4b11e38,  free space=1015k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C  [libc.so.6+0x151bae]  __memmove_ssse3_back+0x192e
> C  [librocksdbjni3701435679326554484.so+0x3b2263]  
> rocksdb::MemTableInserter::DeleteCF(unsigned int, rocksdb::Slice const&)+0x253
> C  [librocksdbjni3701435679326554484.so+0x3a889f]  
> rocksdb::WriteBatchInternal::Iterate(rocksdb::WriteBatch const*, 
> rocksdb::WriteBatch::Handler*, unsigned long, unsigned long)+0x75f
> C  [librocksdbjni3701435679326554484.so+0x3a8d44]  
> rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x24
> C  [librocksdbjni3701435679326554484.so+0x3ac3f9]  
> rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
> unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
> rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, 
> bool, bool)+0x249
> C  [librocksdbjni3701435679326554484.so+0x2f6308]  
> rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, 
> bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x1e98
> C  [librocksdbjni3701435679326554484.so+0x2f70c1]  
> rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x21
> C  [librocksdbjni3701435679326554484.so+0x1dd0cc]  
> Java_org_rocksdb_RocksDB_write0+0xcc
> j  org.rocksdb.RocksDB.write0(JJJ)V+0
> J 8597 C1 
> org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.putBlock(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/ozone/container/common/helpers/BlockData;Z)J
>  (487 bytes) @ 0x7eff3a8dd84c [0x7eff3a8db8e0+0x1f6c]
> J 8700 C1 
> org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handlePutBlock(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainer;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg

[jira] [Created] (HDDS-4351) DN crash while RatisApplyTransactionExecutor tries to putBlock to rocksDB

2020-10-14 Thread Glen Geng (Jira)
Glen Geng created HDDS-4351:
---

 Summary: DN crash while RatisApplyTransactionExecutor tries to 
putBlock to rocksDB
 Key: HDDS-4351
 URL: https://issues.apache.org/jira/browse/HDDS-4351
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Datanode
Affects Versions: 1.1.0
Reporter: Glen Geng
Assignee: Ethan Rose


In Tencent, we monthly pick up the latest mater and deploy them to our 
production environment.

This time, we tested c956ce6 (HDDS-4262 [. Use ClientID and CallID from Rpc 
Client to detect retry 
re…|https://github.com/apache/hadoop-ozone/commit/c956ce6b7537a0286c01b15d496a7ffeba90]
 ), encountered frequently crash in datanode while putBlock.

 

*The setup* is 3 DN, each engage in 8 pipelines. 1 OM 1 SCM and 1 Gateway.

*The repo procedure* is simple. Continually writing 10GB size files to s3g from 
python (the aws lib boto3), after write tens of files, DN might crash while 
applying putBlock operations.

After running the test on the version that revert HDDS-3869 for 10 hours,,  no 
DN crash occurred.

Will schedule a long run test on the latest master with HDDS-4327, to check if 
adding try-with-resource to BatchOperation could fix the crash issue.

 

*Example1: segment fault while putBlock.*
{code:java}
Current thread (0x7eff34524000):  JavaThread "RatisApplyTransactionExecutor 
9" daemon [_thread_in_native, id=20401, 
stack(0x7efef4a14000,0x7efef4b15000)]siginfo: si_signo: 11 (SIGSEGV), 
si_code: 2 (SEGV_ACCERR), si_addr: 0x7eff37eb9000Registers:
RAX=0x7efe8bbfb024, RBX=0x, RCX=0x, 
RDX=0x007688e4
RSP=0x7efef4b11e38, RBP=0x7efef4b11f60, RSI=0x7eff37eb8feb, 
RDI=0x7efe8f892640
R8 =0x7efe8bbfb024, R9 =0x0080, R10=0x0022, 
R11=0x1000
R12=0x7efef4b12100, R13=0x7eff340badc0, R14=0x7eff340bb7b0, 
R15=0x0440
RIP=0x7eff4fa04bae, EFLAGS=0x00010206, CSGSFS=0x0033, 
ERR=0x0004
  TRAPNO=0x000eStack: [0x7efef4a14000,0x7efef4b15000],  
sp=0x7efef4b11e38,  free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libc.so.6+0x151bae]  __memmove_ssse3_back+0x192e
C  [librocksdbjni3701435679326554484.so+0x3b2263]  
rocksdb::MemTableInserter::DeleteCF(unsigned int, rocksdb::Slice const&)+0x253
C  [librocksdbjni3701435679326554484.so+0x3a889f]  
rocksdb::WriteBatchInternal::Iterate(rocksdb::WriteBatch const*, 
rocksdb::WriteBatch::Handler*, unsigned long, unsigned long)+0x75f
C  [librocksdbjni3701435679326554484.so+0x3a8d44]  
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x24
C  [librocksdbjni3701435679326554484.so+0x3ac3f9]  
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, 
bool)+0x249
C  [librocksdbjni3701435679326554484.so+0x2f6308]  
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x1e98
C  [librocksdbjni3701435679326554484.so+0x2f70c1]  
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
C  [librocksdbjni3701435679326554484.so+0x1dd0cc]  
Java_org_rocksdb_RocksDB_write0+0xcc
j  org.rocksdb.RocksDB.write0(JJJ)V+0
J 8597 C1 
org.apache.hadoop.ozone.container.keyvalue.impl.BlockManagerImpl.putBlock(Lorg/apache/hadoop/ozone/container/common/interfaces/Container;Lorg/apache/hadoop/ozone/container/common/helpers/BlockData;Z)J
 (487 bytes) @ 0x7eff3a8dd84c [0x7eff3a8db8e0+0x1f6c]
J 8700 C1 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.handlePutBlock(Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainer;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto;
 (211 bytes) @ 0x7eff3a927ebc [0x7eff3a926220+0x1c9c]
J 6685 C1 
org.apache.hadoop.ozone.container.keyvalue.KeyValueHandler.dispatchRequest(Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueHandler;Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandRequestProto;Lorg/apache/hadoop/ozone/container/keyvalue/KeyValueContainer;Lorg/apache/hadoop/ozone/container/common/transport/server/ratis/DispatcherContext;)Lorg/apache/hadoop/hdds/protocol/datanode/proto/ContainerProtos$ContainerCommandResponseProto;
 (228 bytes) @ 0x7eff3a2ba2c4 [0x7eff3a2b7640+0x2c84]
J 6684 C1 
org.apache.hadoop.ozone.container.

[jira] [Updated] (HDDS-4343) ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly.

2020-10-14 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4343:

Priority: Minor  (was: Blocker)

> ReplicationManager.handleOverReplicatedContainer() does not handle 
> unhealthyReplicas properly.
> --
>
> Key: HDDS-4343
> URL: https://issues.apache.org/jira/browse/HDDS-4343
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>  Labels: pull-request-available
>
> {code:java}
>   // If there are unhealthy replicas, then we should remove them even if 
> it
>   // makes the container violate the placement policy, as excess unhealthy
>   // containers are not really useful. It will be corrected later as a
>   // mis-replicated container will be seen as under-replicated.
>   for (ContainerReplica r : unhealthyReplicas) {
> if (excess > 0) {
>   sendDeleteCommand(container, r.getDatanodeDetails(), true);
>   excess -= 1;
> }
> break;
>   }
>   // After removing all unhealthy replicas, if the container is still over
>   // replicated then we need to check if it is already mis-replicated.
>   // If it is, we do no harm by removing excess replicas. However, if it 
> is
>   // not mis-replicated, then we can only remove replicas if they don't
>   // make the container become mis-replicated.
> {code}
> From the comment, it wants to remove all unhealthy replicas until excess 
> reach 0 ? It should be
> {code:java}
>   for (ContainerReplica r : unhealthyReplicas) {
> if (excess > 0) {
>   sendDeleteCommand(container, r.getDatanodeDetails(), true);
>   excess -= 1;
> } else {
>   break;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4343) ReplicationManager.handleOverReplicatedContainer() does not handle unhealthyReplicas properly.

2020-10-14 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4343:

Summary: ReplicationManager.handleOverReplicatedContainer() does not handle 
unhealthyReplicas properly.  (was: 
ReplicationManager.handleOverReplicatedContainer does not handle )

> ReplicationManager.handleOverReplicatedContainer() does not handle 
> unhealthyReplicas properly.
> --
>
> Key: HDDS-4343
> URL: https://issues.apache.org/jira/browse/HDDS-4343
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> {code:java}
>   // If there are unhealthy replicas, then we should remove them even if 
> it
>   // makes the container violate the placement policy, as excess unhealthy
>   // containers are not really useful. It will be corrected later as a
>   // mis-replicated container will be seen as under-replicated.
>   for (ContainerReplica r : unhealthyReplicas) {
> if (excess > 0) {
>   sendDeleteCommand(container, r.getDatanodeDetails(), true);
>   excess -= 1;
> }
> break;
>   }
>   // After removing all unhealthy replicas, if the container is still over
>   // replicated then we need to check if it is already mis-replicated.
>   // If it is, we do no harm by removing excess replicas. However, if it 
> is
>   // not mis-replicated, then we can only remove replicas if they don't
>   // make the container become mis-replicated.
> {code}
> From the comment, it wants to remove all unhealthy replicas until excess 
> reach 0 ? It should be
> {code:java}
>   for (ContainerReplica r : unhealthyReplicas) {
> if (excess > 0) {
>   sendDeleteCommand(container, r.getDatanodeDetails(), true);
>   excess -= 1;
> } else {
>   break;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4343) CLONE - OM client request fails with "failed to commit as key is not found in OpenKey table"

2020-10-14 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4343:

Description: 
{code:java}
  // If there are unhealthy replicas, then we should remove them even if it
  // makes the container violate the placement policy, as excess unhealthy
  // containers are not really useful. It will be corrected later as a
  // mis-replicated container will be seen as under-replicated.
  for (ContainerReplica r : unhealthyReplicas) {
if (excess > 0) {
  sendDeleteCommand(container, r.getDatanodeDetails(), true);
  excess -= 1;
}
break;
  }
  // After removing all unhealthy replicas, if the container is still over
  // replicated then we need to check if it is already mis-replicated.
  // If it is, we do no harm by removing excess replicas. However, if it is
  // not mis-replicated, then we can only remove replicas if they don't
  // make the container become mis-replicated.
{code}

>From the comment, it wants to remove all unhealthy replicas until excess reach 
>0 ? It should be
{code:java}
  for (ContainerReplica r : unhealthyReplicas) {
if (excess > 0) {
  sendDeleteCommand(container, r.getDatanodeDetails(), true);
  excess -= 1;
} else {
  break;
}
  }
{code}

  was:
{code:java}
  // If there are unhealthy replicas, then we should remove them even if it
  // makes the container violate the placement policy, as excess unhealthy
  // containers are not really useful. It will be corrected later as a
  // mis-replicated container will be seen as under-replicated.
  for (ContainerReplica r : unhealthyReplicas) {
if (excess > 0) {
  sendDeleteCommand(container, r.getDatanodeDetails(), true);
  excess -= 1;
}
break;
  }
  // After removing all unhealthy replicas, if the container is still over
  // replicated then we need to check if it is already mis-replicated.
  // If it is, we do no harm by removing excess replicas. However, if it is
  // not mis-replicated, then we can only remove replicas if they don't
  // make the container become mis-replicated.it seems that the comments 
want to remove all unhealthy replicas until excess reach 0 ?I guess it should be
  for (ContainerReplica r : unhealthyReplicas) {
if (excess > 0) {
  sendDeleteCommand(container, r.getDatanodeDetails(), true);
  excess -= 1;
} else {
  break;
}
  }
{code}


> CLONE - OM client request fails with "failed to commit as key is not found in 
> OpenKey table"
> 
>
> Key: HDDS-4343
> URL: https://issues.apache.org/jira/browse/HDDS-4343
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> {code:java}
>   // If there are unhealthy replicas, then we should remove them even if 
> it
>   // makes the container violate the placement policy, as excess unhealthy
>   // containers are not really useful. It will be corrected later as a
>   // mis-replicated container will be seen as under-replicated.
>   for (ContainerReplica r : unhealthyReplicas) {
> if (excess > 0) {
>   sendDeleteCommand(container, r.getDatanodeDetails(), true);
>   excess -= 1;
> }
> break;
>   }
>   // After removing all unhealthy replicas, if the container is still over
>   // replicated then we need to check if it is already mis-replicated.
>   // If it is, we do no harm by removing excess replicas. However, if it 
> is
>   // not mis-replicated, then we can only remove replicas if they don't
>   // make the container become mis-replicated.
> {code}
> From the comment, it wants to remove all unhealthy replicas until excess 
> reach 0 ? It should be
> {code:java}
>   for (ContainerReplica r : unhealthyReplicas) {
> if (excess > 0) {
>   sendDeleteCommand(container, r.getDatanodeDetails(), true);
>   excess -= 1;
> } else {
>   break;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4343) ReplicationManager.handleOverReplicatedContainer does not handle

2020-10-14 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4343:

Summary: ReplicationManager.handleOverReplicatedContainer does not handle   
(was: CLONE - OM client request fails with "failed to commit as key is not 
found in OpenKey table")

> ReplicationManager.handleOverReplicatedContainer does not handle 
> -
>
> Key: HDDS-4343
> URL: https://issues.apache.org/jira/browse/HDDS-4343
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> {code:java}
>   // If there are unhealthy replicas, then we should remove them even if 
> it
>   // makes the container violate the placement policy, as excess unhealthy
>   // containers are not really useful. It will be corrected later as a
>   // mis-replicated container will be seen as under-replicated.
>   for (ContainerReplica r : unhealthyReplicas) {
> if (excess > 0) {
>   sendDeleteCommand(container, r.getDatanodeDetails(), true);
>   excess -= 1;
> }
> break;
>   }
>   // After removing all unhealthy replicas, if the container is still over
>   // replicated then we need to check if it is already mis-replicated.
>   // If it is, we do no harm by removing excess replicas. However, if it 
> is
>   // not mis-replicated, then we can only remove replicas if they don't
>   // make the container become mis-replicated.
> {code}
> From the comment, it wants to remove all unhealthy replicas until excess 
> reach 0 ? It should be
> {code:java}
>   for (ContainerReplica r : unhealthyReplicas) {
> if (excess > 0) {
>   sendDeleteCommand(container, r.getDatanodeDetails(), true);
>   excess -= 1;
> } else {
>   break;
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4343) CLONE - OM client request fails with "failed to commit as key is not found in OpenKey table"

2020-10-14 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4343:

Description: 
{code:java}
  // If there are unhealthy replicas, then we should remove them even if it
  // makes the container violate the placement policy, as excess unhealthy
  // containers are not really useful. It will be corrected later as a
  // mis-replicated container will be seen as under-replicated.
  for (ContainerReplica r : unhealthyReplicas) {
if (excess > 0) {
  sendDeleteCommand(container, r.getDatanodeDetails(), true);
  excess -= 1;
}
break;
  }
  // After removing all unhealthy replicas, if the container is still over
  // replicated then we need to check if it is already mis-replicated.
  // If it is, we do no harm by removing excess replicas. However, if it is
  // not mis-replicated, then we can only remove replicas if they don't
  // make the container become mis-replicated.it seems that the comments 
want to remove all unhealthy replicas until excess reach 0 ?I guess it should be
  for (ContainerReplica r : unhealthyReplicas) {
if (excess > 0) {
  sendDeleteCommand(container, r.getDatanodeDetails(), true);
  excess -= 1;
} else {
  break;
}
  }
{code}

  was:
{code:java}

20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28868 $Proxy17.submitRequest over 
nodeId=om3,nodeAddress=vc1330.halxg.cloudera.com:9862
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28870 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28869 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28871 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28872 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28866 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28867 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28874 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28875 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 ERROR freon.BaseFreonGenerator: Error on executing task 14424
KEY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Failed to 
commit key, as /vol1/bucket1/akjkdz4hoj/14424/104766512182520809entry is not 
found in the OpenKey table
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:593)
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:650)
at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:306)
at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:514)
at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60)
at 
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$0(OzoneClientKeyGenerator.java:118)
at com.codahale.metrics.Timer.time(Timer.java:101)
at 
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:113)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:178)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:167)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$0(BaseFreonGenerator.java:150)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}


> CLONE - OM client request fails with "failed to commit as key is not found in 
> OpenKey tab

[jira] [Assigned] (HDDS-4343) CLONE - OM client request fails with "failed to commit as key is not found in OpenKey table"

2020-10-14 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng reassigned HDDS-4343:
---

Assignee: Glen Geng  (was: Bharat Viswanadham)

> CLONE - OM client request fails with "failed to commit as key is not found in 
> OpenKey table"
> 
>
> Key: HDDS-4343
> URL: https://issues.apache.org/jira/browse/HDDS-4343
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: OM HA
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> OM client request fails with "failed to commit as key is not found in OpenKey 
> table"
> {code:java}
> 20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28868 $Proxy17.submitRequest over 
> nodeId=om3,nodeAddress=vc1330.halxg.cloudera.com:9862
> 20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28870 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28869 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28871 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28872 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28866 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28867 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28874 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28875 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 ERROR freon.BaseFreonGenerator: Error on executing task 
> 14424
> KEY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Failed to 
> commit key, as /vol1/bucket1/akjkdz4hoj/14424/104766512182520809entry is not 
> found in the OpenKey table
> at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:593)
> at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:650)
> at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:306)
> at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:514)
> at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60)
> at 
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$0(OzoneClientKeyGenerator.java:118)
> at com.codahale.metrics.Timer.time(Timer.java:101)
> at 
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:113)
> at 
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:178)
> at 
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:167)
> at 
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$0(BaseFreonGenerator.java:150)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4343) CLONE - OM client request fails with "failed to commit as key is not found in OpenKey table"

2020-10-14 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4343:

Component/s: (was: OM HA)
 SCM

> CLONE - OM client request fails with "failed to commit as key is not found in 
> OpenKey table"
> 
>
> Key: HDDS-4343
> URL: https://issues.apache.org/jira/browse/HDDS-4343
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> OM client request fails with "failed to commit as key is not found in OpenKey 
> table"
> {code:java}
> 20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28868 $Proxy17.submitRequest over 
> nodeId=om3,nodeAddress=vc1330.halxg.cloudera.com:9862
> 20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28870 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28869 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28871 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28872 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28866 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28867 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28874 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
> since the start of call #28875 $Proxy17.submitRequest over 
> nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
> 20/08/28 03:21:54 ERROR freon.BaseFreonGenerator: Error on executing task 
> 14424
> KEY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Failed to 
> commit key, as /vol1/bucket1/akjkdz4hoj/14424/104766512182520809entry is not 
> found in the OpenKey table
> at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:593)
> at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:650)
> at 
> org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:306)
> at 
> org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:514)
> at 
> org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60)
> at 
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$0(OzoneClientKeyGenerator.java:118)
> at com.codahale.metrics.Timer.time(Timer.java:101)
> at 
> org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:113)
> at 
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:178)
> at 
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:167)
> at 
> org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$0(BaseFreonGenerator.java:150)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4343) CLONE - OM client request fails with "failed to commit as key is not found in OpenKey table"

2020-10-14 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4343:

Description: 
{code:java}

20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28868 $Proxy17.submitRequest over 
nodeId=om3,nodeAddress=vc1330.halxg.cloudera.com:9862
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28870 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28869 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28871 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28872 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28866 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28867 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28874 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28875 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 ERROR freon.BaseFreonGenerator: Error on executing task 14424
KEY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Failed to 
commit key, as /vol1/bucket1/akjkdz4hoj/14424/104766512182520809entry is not 
found in the OpenKey table
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:593)
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:650)
at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:306)
at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:514)
at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60)
at 
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$0(OzoneClientKeyGenerator.java:118)
at com.codahale.metrics.Timer.time(Timer.java:101)
at 
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:113)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:178)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:167)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$0(BaseFreonGenerator.java:150)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

  was:
OM client request fails with "failed to commit as key is not found in OpenKey 
table"

{code:java}
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28868 $Proxy17.submitRequest over 
nodeId=om3,nodeAddress=vc1330.halxg.cloudera.com:9862
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28870 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28869 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28871 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28872 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28866 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover 

[jira] [Created] (HDDS-4343) CLONE - OM client request fails with "failed to commit as key is not found in OpenKey table"

2020-10-14 Thread Glen Geng (Jira)
Glen Geng created HDDS-4343:
---

 Summary: CLONE - OM client request fails with "failed to commit as 
key is not found in OpenKey table"
 Key: HDDS-4343
 URL: https://issues.apache.org/jira/browse/HDDS-4343
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: OM HA
Reporter: Glen Geng
Assignee: Bharat Viswanadham


OM client request fails with "failed to commit as key is not found in OpenKey 
table"

{code:java}
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28868 $Proxy17.submitRequest over 
nodeId=om3,nodeAddress=vc1330.halxg.cloudera.com:9862
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28870 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:53 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28869 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28871 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28872 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28866 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28867 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28874 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 WARN retry.RetryInvocationHandler: A failover has occurred 
since the start of call #28875 $Proxy17.submitRequest over 
nodeId=om1,nodeAddress=vc1325.halxg.cloudera.com:9862
20/08/28 03:21:54 ERROR freon.BaseFreonGenerator: Error on executing task 14424
KEY_NOT_FOUND org.apache.hadoop.ozone.om.exceptions.OMException: Failed to 
commit key, as /vol1/bucket1/akjkdz4hoj/14424/104766512182520809entry is not 
found in the OpenKey table
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:593)
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.commitKey(OzoneManagerProtocolClientSideTranslatorPB.java:650)
at 
org.apache.hadoop.ozone.client.io.BlockOutputStreamEntryPool.commitKey(BlockOutputStreamEntryPool.java:306)
at 
org.apache.hadoop.ozone.client.io.KeyOutputStream.close(KeyOutputStream.java:514)
at 
org.apache.hadoop.ozone.client.io.OzoneOutputStream.close(OzoneOutputStream.java:60)
at 
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.lambda$createKey$0(OzoneClientKeyGenerator.java:118)
at com.codahale.metrics.Timer.time(Timer.java:101)
at 
org.apache.hadoop.ozone.freon.OzoneClientKeyGenerator.createKey(OzoneClientKeyGenerator.java:113)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.tryNextTask(BaseFreonGenerator.java:178)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.taskLoop(BaseFreonGenerator.java:167)
at 
org.apache.hadoop.ozone.freon.BaseFreonGenerator.lambda$startTaskRunners$0(BaseFreonGenerator.java:150)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4230) SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException

2020-09-10 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4230:

Description: 
It is an enhancement for HDDS-3188.

Like OMFailoverProxyProvider, SCMBlockLocationFailoverProxyProvider should also 
handle LeaderNotReadyException.

If SCM client (like OzoneManager) has touched leader SCM, meanwhile leader SCM 
is stuck in replaying raft log entries(e.g., that SCM restarts and becomes 
leader, it needs time to recover its state machine by replaying all raft log 
entries), SCM client should not round robin to the next SCM, It should wait and 
retry the same SCM later.

  was:
like OMFailoverProxyProvider,  SCMBlockLocationFailoverProxyProvider should 
also handle LeaderNotReadyException.

If scm client (like OzoneManager) has touched leader SCM, meanwhile leader SCM 
is stuck in replaying raft log entries, scm client should not round robin to 
next SCM, It should wait and retry the same SCM later.


> SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException
> ---
>
> Key: HDDS-4230
> URL: https://issues.apache.org/jira/browse/HDDS-4230
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
>
> It is an enhancement for HDDS-3188.
> Like OMFailoverProxyProvider, SCMBlockLocationFailoverProxyProvider should 
> also handle LeaderNotReadyException.
> If SCM client (like OzoneManager) has touched leader SCM, meanwhile leader 
> SCM is stuck in replaying raft log entries(e.g., that SCM restarts and 
> becomes leader, it needs time to recover its state machine by replaying all 
> raft log entries), SCM client should not round robin to the next SCM, It 
> should wait and retry the same SCM later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-4230) SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException

2020-09-10 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng reassigned HDDS-4230:
---

Assignee: (was: Li Cheng)

> SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException
> ---
>
> Key: HDDS-4230
> URL: https://issues.apache.org/jira/browse/HDDS-4230
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Priority: Major
>  Labels: pull-request-available
>
> It is an enhancement for HDDS-3188.
> Like OMFailoverProxyProvider, SCMBlockLocationFailoverProxyProvider should 
> also handle LeaderNotReadyException.
> If SCM client (like OzoneManager) has touched leader SCM, meanwhile leader 
> SCM is stuck in replaying raft log entries(e.g., that SCM restarts and 
> becomes leader, it needs time to recover its state machine by replaying all 
> raft log entries), SCM client should not round robin to the next SCM, It 
> should wait and retry the same SCM later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4230) SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException

2020-09-10 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4230:

Description: 
like OMFailoverProxyProvider,  SCMBlockLocationFailoverProxyProvider should 
also handle LeaderNotReadyException.

If scm client (like OzoneManager) has touched leader SCM, meanwhile leader SCM 
is stuck in replaying raft log entries, scm client should not round robin to 
next SCM, It should wait and retry the same SCM later.

  was:like OMFailoverProxyProvider, 


> SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException
> ---
>
> Key: HDDS-4230
> URL: https://issues.apache.org/jira/browse/HDDS-4230
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
>
> like OMFailoverProxyProvider,  SCMBlockLocationFailoverProxyProvider should 
> also handle LeaderNotReadyException.
> If scm client (like OzoneManager) has touched leader SCM, meanwhile leader 
> SCM is stuck in replaying raft log entries, scm client should not round robin 
> to next SCM, It should wait and retry the same SCM later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4230) SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException

2020-09-10 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4230:

Description: like OMFailoverProxyProvider,   (was: Need to supports 2N + 1 
SCMs. Add configs and logic to support multiple SCMs.)

> SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException
> ---
>
> Key: HDDS-4230
> URL: https://issues.apache.org/jira/browse/HDDS-4230
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
>
> like OMFailoverProxyProvider, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4230) SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException

2020-09-10 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4230:

Summary: SCMBlockLocationFailoverProxyProvider should handle 
LeaderNotReadyException  (was: CLONE - Add failover proxy to SCM block protocol)

> SCMBlockLocationFailoverProxyProvider should handle LeaderNotReadyException
> ---
>
> Key: HDDS-4230
> URL: https://issues.apache.org/jira/browse/HDDS-4230
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
>
> Need to supports 2N + 1 SCMs. Add configs and logic to support multiple SCMs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4230) CLONE - Add failover proxy to SCM block protocol

2020-09-10 Thread Glen Geng (Jira)
Glen Geng created HDDS-4230:
---

 Summary: CLONE - Add failover proxy to SCM block protocol
 Key: HDDS-4230
 URL: https://issues.apache.org/jira/browse/HDDS-4230
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
  Components: SCM
Reporter: Glen Geng
Assignee: Li Cheng


Need to supports 2N + 1 SCMs. Add configs and logic to support multiple SCMs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4228) add field 'num' to ALLOCATE_BLOCK of scm audit log.

2020-09-09 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4228:

Labels: pull-requests-available  (was: pull-request-available)

> add field 'num' to ALLOCATE_BLOCK of scm audit log.
> ---
>
> Key: HDDS-4228
> URL: https://issues.apache.org/jira/browse/HDDS-4228
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>  Labels: pull-requests-available
>
>  
> The scm audit log for ALLOCATE_BLOCK is as follows:
> {code:java}
> 2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
> op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, 
> size=268435456, type=RATIS, factor=THREE} | ret=SUCCESS |{code}
>  
> One might be interested about the num of blocks allocated, better add field 
> 'num' to  ALLOCATE_BLOCK of scm audit log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4228) add field 'num' to ALLOCATE_BLOCK of scm audit log.

2020-09-09 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4228:

Description: 
 

The scm audit log for ALLOCATE_BLOCK is as follows:
{code:java}
2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, size=268435456, 
type=RATIS, factor=THREE} | ret=SUCCESS |{code}
 

One might be interested about the num of blocks allocated, better add field 
'num' to  ALLOCATE_BLOCK of scm audit log.

  was:
 

The scm audit log for ALLOCATE_BLOCK is as follows:
{code:java}
2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, size=268435456, 
type=RATIS, factor=THREE} | ret=SUCCESS |{code}
 

Better add num of blocks allocated into the audit log, one might be interested 
about the num of blocks allocated, better add field 'num' to 


> add field 'num' to ALLOCATE_BLOCK of scm audit log.
> ---
>
> Key: HDDS-4228
> URL: https://issues.apache.org/jira/browse/HDDS-4228
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>
>  
> The scm audit log for ALLOCATE_BLOCK is as follows:
> {code:java}
> 2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
> op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, 
> size=268435456, type=RATIS, factor=THREE} | ret=SUCCESS |{code}
>  
> One might be interested about the num of blocks allocated, better add field 
> 'num' to  ALLOCATE_BLOCK of scm audit log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4228) add field 'num' to ALLOCATE_BLOCK of scm audit log.

2020-09-09 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4228:

Description: 
 

The scm audit log for ALLOCATE_BLOCK is as follows:
{code:java}
2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, size=268435456, 
type=RATIS, factor=THREE} | ret=SUCCESS |{code}
 

Better add num of blocks allocated into the audit log, one might be interested 
about the num of blocks allocated, better add field 'num' to 

  was:
 

The sac
{code:java}
2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, size=268435456, 
type=RATIS, factor=THREE} | ret=SUCCESS |{code}


> add field 'num' to ALLOCATE_BLOCK of scm audit log.
> ---
>
> Key: HDDS-4228
> URL: https://issues.apache.org/jira/browse/HDDS-4228
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>
>  
> The scm audit log for ALLOCATE_BLOCK is as follows:
> {code:java}
> 2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
> op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, 
> size=268435456, type=RATIS, factor=THREE} | ret=SUCCESS |{code}
>  
> Better add num of blocks allocated into the audit log, one might be 
> interested about the num of blocks allocated, better add field 'num' to 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4228) add field 'num' to ALLOCATE_BLOCK of scm audit log.

2020-09-09 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4228:

Summary: add field 'num' to ALLOCATE_BLOCK of scm audit log.  (was: 
ALLOCATE_BLOCK of scm audit log miss num)

> add field 'num' to ALLOCATE_BLOCK of scm audit log.
> ---
>
> Key: HDDS-4228
> URL: https://issues.apache.org/jira/browse/HDDS-4228
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>
>  
> The sac
> {code:java}
> 2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
> op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, 
> size=268435456, type=RATIS, factor=THREE} | ret=SUCCESS |{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4228) ALLOCATE_BLOCK of scm audit log miss num

2020-09-09 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4228:

Description: 
 

The sac
{code:java}
2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, size=268435456, 
type=RATIS, factor=THREE} | ret=SUCCESS |{code}

  was:
Teragen reported to be slow with low number of mappers compared to HDFS.

In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins but 
with Ozone it was 6 mins. It could be fixed with using more mappers, but when I 
investigated the execution I found a few problems reagrding to the BufferPool 
management.

 1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
itself is incremental
 2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
which can be a slow operation (positions should be calculated).
 3. There is no explicit support for write(byte) operations

In the flamegraph it's clearly visible that with low number of mappers the 
client is busy with buffer operations. After the patch the rpc call and the 
checksum calculation give the majority of the time. 


> ALLOCATE_BLOCK of scm audit log miss num
> 
>
> Key: HDDS-4228
> URL: https://issues.apache.org/jira/browse/HDDS-4228
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>
>  
> The sac
> {code:java}
> 2020-09-10 03:42:08,196 | INFO | SCMAudit | user=root | ip=172.16.90.221 | 
> op=ALLOCATE_BLOCK {owner=7da0b4c4-d053-4fa0-8648-44ff0b8ba1bf, 
> size=268435456, type=RATIS, factor=THREE} | ret=SUCCESS |{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-4228) ALLOCATE_BLOCK of scm audit log miss num

2020-09-09 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng reassigned HDDS-4228:
---

Assignee: Glen Geng  (was: Marton Elek)

> ALLOCATE_BLOCK of scm audit log miss num
> 
>
> Key: HDDS-4228
> URL: https://issues.apache.org/jira/browse/HDDS-4228
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>  Labels: pull-request-available
>
> Teragen reported to be slow with low number of mappers compared to HDFS.
> In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins 
> but with Ozone it was 6 mins. It could be fixed with using more mappers, but 
> when I investigated the execution I found a few problems reagrding to the 
> BufferPool management.
>  1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
> itself is incremental
>  2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
> which can be a slow operation (positions should be calculated).
>  3. There is no explicit support for write(byte) operations
> In the flamegraph it's clearly visible that with low number of mappers the 
> client is busy with buffer operations. After the patch the rpc call and the 
> checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4228) ALLOCATE_BLOCK of scm audit log miss num

2020-09-09 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4228:

Labels:   (was: pull-request-available)

> ALLOCATE_BLOCK of scm audit log miss num
> 
>
> Key: HDDS-4228
> URL: https://issues.apache.org/jira/browse/HDDS-4228
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Minor
>
> Teragen reported to be slow with low number of mappers compared to HDFS.
> In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins 
> but with Ozone it was 6 mins. It could be fixed with using more mappers, but 
> when I investigated the execution I found a few problems reagrding to the 
> BufferPool management.
>  1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
> itself is incremental
>  2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
> which can be a slow operation (positions should be calculated).
>  3. There is no explicit support for write(byte) operations
> In the flamegraph it's clearly visible that with low number of mappers the 
> client is busy with buffer operations. After the patch the rpc call and the 
> checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4228) ALLOCATE_BLOCK of scm audit log miss num

2020-09-09 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4228:

Priority: Minor  (was: Blocker)

> ALLOCATE_BLOCK of scm audit log miss num
> 
>
> Key: HDDS-4228
> URL: https://issues.apache.org/jira/browse/HDDS-4228
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Marton Elek
>Priority: Minor
>  Labels: pull-request-available
>
> Teragen reported to be slow with low number of mappers compared to HDFS.
> In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins 
> but with Ozone it was 6 mins. It could be fixed with using more mappers, but 
> when I investigated the execution I found a few problems reagrding to the 
> BufferPool management.
>  1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
> itself is incremental
>  2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
> which can be a slow operation (positions should be calculated).
>  3. There is no explicit support for write(byte) operations
> In the flamegraph it's clearly visible that with low number of mappers the 
> client is busy with buffer operations. After the patch the rpc call and the 
> checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4228) ALLOCATE_BLOCK of scm audit log miss num

2020-09-09 Thread Glen Geng (Jira)
Glen Geng created HDDS-4228:
---

 Summary: ALLOCATE_BLOCK of scm audit log miss num
 Key: HDDS-4228
 URL: https://issues.apache.org/jira/browse/HDDS-4228
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Glen Geng
Assignee: Marton Elek


Teragen reported to be slow with low number of mappers compared to HDFS.

In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins but 
with Ozone it was 6 mins. It could be fixed with using more mappers, but when I 
investigated the execution I found a few problems reagrding to the BufferPool 
management.

 1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
itself is incremental
 2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
which can be a slow operation (positions should be calculated).
 3. There is no explicit support for write(byte) operations

In the flamegraph it's clearly visible that with low number of mappers the 
client is busy with buffer operations. After the patch the rpc call and the 
checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-4107) replace scmID with clusterID for container and volume at Datanode side

2020-09-04 Thread Glen Geng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190638#comment-17190638
 ] 

Glen Geng edited comment on HDDS-4107 at 9/4/20, 9:06 AM:
--

I am working on the upgrade issues for this PR, found that renaming the volume 
dir is not enough.

The scmId is not only in the volume dir, but in meta data of container as well, 
which will make the upgrade procedure for a huge cluster to be impossible.

 

Here is an example from the misc/upgrade case of acceptance.

*Version file as follows:*
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data]$
 cat scm/metadata/scm/current/VERSION 
#Fri Sep 04 07:08:47 UTC 2020
cTime=1599203327270
clusterID=CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5
nodeType=SCM
scmUuid=9176f875-c8f2-4dcd-8d1d-5b988ad25914
{code}
 

*the layout after rename the volume layout from scmId to clusterId*
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data/dn2/hdds]$
 tree .
.
|-- hdds
|   |-- CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5
|   |   `-- current
|   |   `-- containerDir0
|   |   `-- 1
|   |   |-- chunks
|   |   |   `-- 104805390775943168_chunk_1
|   |   `-- metadata
|   |   |-- 1-dn-container.db
|   |   |   |-- 06.log
|   |   |   |-- CURRENT
|   |   |   |-- IDENTITY
|   |   |   |-- LOCK
|   |   |   |-- LOG
|   |   |   |-- LOG.old.1599203351819700
|   |   |   |-- MANIFEST-05
|   |   |   |-- OPTIONS-05
|   |   |   `-- OPTIONS-08
|   |   `-- 1.container
|   `-- VERSION
`-- scmUsed8 directories, 13 files
{code}
 

*grep scmId, and found them from meta data of container.* 
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data/dn2]$
 find . -type f | xargs grep 9176f875
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/OPTIONS-08:
  
wal_dir=/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819945
 7fe6f187b700 SST files in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
 dir, Total Num: 0, files: 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819947
 7fe6f187b700 Write Ahead Log file in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db:
 03.log size: 0 ; 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819972
 7fe6f187b700 Options.wal_dir: 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.820954
 7fe6f187b700 [/version_set.cc:3731] Recovered from manifest 
file:/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db/MANIFEST-01
 succeeded,manifest_file_number is 1, next_file_number is 3, last_sequence is 
0, log_number is 0,prev_log_number is 0,max_column_family is 
0,min_log_number_to_keep is 0
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/OPTIONS-05:
  
wal_dir=/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716714
 7fe702591700 SST files in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
 dir, Total Num: 0, files: 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716716
 7fe702591700 Write Ahead Log file in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db:
 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716737
 7fe702591700 Options.wal_dir: 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-

[jira] [Commented] (HDDS-4107) replace scmID with clusterID for container and volume at Datanode side

2020-09-04 Thread Glen Geng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-4107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190638#comment-17190638
 ] 

Glen Geng commented on HDDS-4107:
-

I am working on the upgrade issues for this PR, found that renaming the volume 
dir is not enough.

The scmId is not only in the volume dir, but in meta data of container as well, 
which will make the upgrade procedure for a huge cluster to be impossible.

 

Here is an example from the misc/upgrade case of acceptance.

Version file as follows:

 
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data]$
 cat scm/metadata/scm/current/VERSION 
#Fri Sep 04 07:08:47 UTC 2020
cTime=1599203327270
clusterID=CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5
nodeType=SCM
scmUuid=9176f875-c8f2-4dcd-8d1d-5b988ad25914
{code}
 

 

the layout after rename the volume layout from scmId to clusterId

 
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data/dn2/hdds]$
 tree .
.
|-- hdds
|   |-- CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5
|   |   `-- current
|   |   `-- containerDir0
|   |   `-- 1
|   |   |-- chunks
|   |   |   `-- 104805390775943168_chunk_1
|   |   `-- metadata
|   |   |-- 1-dn-container.db
|   |   |   |-- 06.log
|   |   |   |-- CURRENT
|   |   |   |-- IDENTITY
|   |   |   |-- LOCK
|   |   |   |-- LOG
|   |   |   |-- LOG.old.1599203351819700
|   |   |   |-- MANIFEST-05
|   |   |   |-- OPTIONS-05
|   |   |   `-- OPTIONS-08
|   |   `-- 1.container
|   `-- VERSION
`-- scmUsed8 directories, 13 files
{code}
 

grep scmId, and found them from meta data of container.

 
{code:java}
[hadoop@9 
~/glengeng/hadoop-ozone/hadoop-ozone/dist/target/ozone-0.6.0-SNAPSHOT/compose/upgrade/data/dn2]$
 find . -type f | xargs grep 9176f875
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/OPTIONS-08:
  
wal_dir=/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819945
 7fe6f187b700 SST files in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
 dir, Total Num: 0, files: 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819947
 7fe6f187b700 Write Ahead Log file in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db:
 03.log size: 0 ; 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.819972
 7fe6f187b700 Options.wal_dir: 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG:2020/09/04-07:09:11.820954
 7fe6f187b700 [/version_set.cc:3731] Recovered from manifest 
file:/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db/MANIFEST-01
 succeeded,manifest_file_number is 1, next_file_number is 3, last_sequence is 
0, log_number is 0,prev_log_number is 0,max_column_family is 
0,min_log_number_to_keep is 0
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/OPTIONS-05:
  
wal_dir=/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716714
 7fe702591700 SST files in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
 dir, Total Num: 0, files: 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716716
 7fe702591700 Write Ahead Log file in 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db:
 
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current/containerDir0/1/metadata/1-dn-container.db/LOG.old.1599203351819700:2020/09/04-07:09:11.716737
 7fe702591700 Options.wal_dir: 
/data/hdds/hdds/9176f875-c8f2-4dcd-8d1d-5b988ad25914/current/containerDir0/1/metadata/1-dn-container.db
./hdds/hdds/CID-a1d16168-8a93-4ab2-b276-a68e0bf4dbb5/current

[jira] [Resolved] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager for SCM/Recon

2020-09-04 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng resolved HDDS-4186.
-
Resolution: Fixed

> Adjust RetryPolicy of SCMConnectionManager for SCM/Recon
> 
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Critical
>  Labels: pull-request-available
>
> *The problem is:*
> If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
> will be stale/dead very soon at SCM side.
>  
> *The root cause is:*
> Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
>  RetryPolicies.retryForeverWithFixedSleep(
>  1000, TimeUnit.MILLISECONDS);
> StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> StorageContainerDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();{code}
>  that for Recon is retryUpToMaximumCountWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
> RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
> 6, TimeUnit.MILLISECONDS);
> ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> ReconDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();
> {code}
>  
> The executorService in DatanodeStateMachine is 
> Executors.newFixedThreadPool(...), whose default pool size is 2, one for 
> Recon, another for SCM.
>  
> When encounter rpc failure, call() of RegisterEndpointTask, 
> VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
> rpcEndpoint.lock(). For example:
> {code:java}
> public EndpointStateMachine.EndPointStates call() throws Exception {
>   rpcEndpoint.lock();
>   try {
> 
> SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
> .sendHeartbeat(request);
> 
>   } finally {
> rpcEndpoint.unlock();
>   }
>   return rpcEndpoint.getState();
> }
> {code}
>  
> If Recon is down, the thread running Recon task will retry due to rpc 
> failure, meanwhile holds the lock of EndpointStateMachine for Recon. When 
> DatanodeStateMachine schedule the next round of SCM/Recon task, the only left 
> thread will be assigned to run Recon task, and blocked at waiting for the 
> lock of EndpointStateMachine for Recon.
> {code:java}
> public EndpointStateMachine.EndPointStates call() throws Exception {
>   rpcEndpoint.lock();
>   ...{code}
>  
> *The solution is:*
> Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may 
> adjust RetryPolicy so that won't retry for longer that 1min. 
>  
> *The change has no side effect:*
> 1) VersionEndpointTask.call() is fine
> 2) RegisterEndpointTask.call() will query containerReport, nodeReport, 
> pipelineReports from OzoneContainer, which is fine.
> 3) HeartbeatEndpointTask.call() will putBackReports(), which is fine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4192) enable SCM Raft Group based on config ozone.scm.names

2020-09-02 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4192:

Description: 
 

Say ozone.scm.names is "ip1,ip2,ip3", scm with ip1 identifies its RaftPeerId as 
scm1,  scm with ip2 identifies its RaftPeerId as scm2, scm with ip3 identifies 
its RaftPeerId as scm3. They will automatically become a raft group.

  was:
 

Say ozone.scm.names is "ip1,ip2,ip3", scm with ip1 identifies its RaftPeerId as 
scm1,  scm with ip2 identifies its RaftPeerId as scm2, scm with ip3 identifies 
its RaftPeerId as scm3. 


> enable SCM Raft Group based on config ozone.scm.names
> -
>
> Key: HDDS-4192
> URL: https://issues.apache.org/jira/browse/HDDS-4192
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
>  
> Say ozone.scm.names is "ip1,ip2,ip3", scm with ip1 identifies its RaftPeerId 
> as scm1,  scm with ip2 identifies its RaftPeerId as scm2, scm with ip3 
> identifies its RaftPeerId as scm3. They will automatically become a raft 
> group.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4192) enable SCM Raft Group based on config ozone.scm.names

2020-09-02 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4192:

Description: 
 

Say ozone.scm.names is "ip1,ip2,ip3", scm with ip1 identifies its RaftPeerId as 
scm1,  scm with ip2 identifies its RaftPeerId as scm2, scm with ip3 identifies 
its RaftPeerId as scm3. 

> enable SCM Raft Group based on config ozone.scm.names
> -
>
> Key: HDDS-4192
> URL: https://issues.apache.org/jira/browse/HDDS-4192
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
>  
> Say ozone.scm.names is "ip1,ip2,ip3", scm with ip1 identifies its RaftPeerId 
> as scm1,  scm with ip2 identifies its RaftPeerId as scm2, scm with ip3 
> identifies its RaftPeerId as scm3. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4192) enable SCM Raft Group based on config ozone.scm.names

2020-09-02 Thread Glen Geng (Jira)
Glen Geng created HDDS-4192:
---

 Summary: enable SCM Raft Group based on config ozone.scm.names
 Key: HDDS-4192
 URL: https://issues.apache.org/jira/browse/HDDS-4192
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
  Components: SCM
Reporter: Glen Geng
Assignee: Glen Geng


 

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 

In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
memory state (container maps in memory) are protected by raft, and should keep 
their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4192) enable SCM Raft Group based on config ozone.scm.names

2020-09-02 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4192:

Labels:   (was: pull-request-available)

> enable SCM Raft Group based on config ozone.scm.names
> -
>
> Key: HDDS-4192
> URL: https://issues.apache.org/jira/browse/HDDS-4192
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
>  
> Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 
> In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
> memory state (container maps in memory) are protected by raft, and should 
> keep their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4192) enable SCM Raft Group based on config ozone.scm.names

2020-09-02 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4192:

Description: (was:  

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 

In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
memory state (container maps in memory) are protected by raft, and should keep 
their consistency upon each modification.)

> enable SCM Raft Group based on config ozone.scm.names
> -
>
> Key: HDDS-4192
> URL: https://issues.apache.org/jira/browse/HDDS-4192
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager for SCM/Recon

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Description: 
*The problem is:*

If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
will be stale/dead very soon at SCM side.

 

*The root cause is:*

Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 that for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);

ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
 

The executorService in DatanodeStateMachine is 
Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, 
another for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock(). For example:
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();

  try {


SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
.sendHeartbeat(request);


  } finally {
rpcEndpoint.unlock();
  }

  return rpcEndpoint.getState();
}
{code}
 

If Recon is down, the thread running Recon task will retry due to rpc failure, 
meanwhile holds the lock of EndpointStateMachine for Recon. When 
DatanodeStateMachine schedule the next round of SCM/Recon task, the only left 
thread will be assigned to run Recon task, and blocked at waiting for the lock 
of EndpointStateMachine for Recon.
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();
  ...{code}
 

*The solution is:*

Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may 
adjust RetryPolicy so that won't retry for longer that 1min. 

 

*The change has no side effect:*

1) VersionEndpointTask.call() is fine

2) RegisterEndpointTask.call() will query containerReport, nodeReport, 
pipelineReports from OzoneContainer, which is fine.

3) HeartbeatEndpointTask.call() will putBackReports(), which is fine.

 

  was:
*The problem is:*

If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
will be stale/dead very soon at SCM side.

 

*The root cause is:*

Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 that for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);

ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
 

The executorService in DatanodeStateMachine is 
Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, 
another for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock(). For example:
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();

  try {


SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
.sendHeartbeat(request);


  } finally {
rpcEndpoint.unlock();
  }

  return rpcEndpoint.getState();
}
{code}
 

The thread running Recon task will retry due to rpc failure, meanwhile holds 
the lock of EndpointStateMachine for Recon. When DatanodeStateMachine schedule 
the next round of SCM/Recon task, the only left thread will be assigned to run 
Recon task, and blocked at waiting for the lock of EndpointStateMachine for 
Recon.
{code:java}
public EndpointStateMachine.EndPointStates call() throws Except

[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager for SCM/Recon

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Description: 
*The problem is:*

If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
will be stale/dead very soon at SCM side.

 

*The root cause is:*

Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 that for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);

ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
 

The executorService in DatanodeStateMachine is 
Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, 
another for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock(). For example:
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();

  try {


SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
.sendHeartbeat(request);


  } finally {
rpcEndpoint.unlock();
  }

  return rpcEndpoint.getState();
}
{code}
 

The thread running Recon task will retry due to rpc failure, meanwhile holds 
the lock of EndpointStateMachine for Recon. When DatanodeStateMachine schedule 
the next round of SCM/Recon task, the only left thread will be assigned to run 
Recon task, and blocked at waiting for the lock of EndpointStateMachine for 
Recon.
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();
  ...{code}
 

*The solution is:*

Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may 
adjust RetryPolicy so that won't retry for longer that 1min. 

 

*The change has no side effect:*

1) VersionEndpointTask.call() is fine

2) RegisterEndpointTask.call() will query containerReport, nodeReport, 
pipelineReports from OzoneContainer, which is fine.

3) HeartbeatEndpointTask.call() will putBackReports(), which is fine.

 

  was:
Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 that for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);

ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
 

The executorService in DatanodeStateMachine is 
Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, 
another for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock(). For example:
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();

  try {


SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
.sendHeartbeat(request);


  } finally {
rpcEndpoint.unlock();
  }

  return rpcEndpoint.getState();
}
{code}
 

*The problem is:*

If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
will be stale/dead very soon at SCM side.

 

*The root cause is:*

The thread running Recon task will retry due to rpc failure, meanwhile holds 
the lock of EndpointStateMachine for Recon. When DatanodeStateMachine schedule 
the next round of SCM/Recon task, the only left thread will be assigned to run 
Recon task, and blocked at waiting for the lock of EndpointStateMachine for 
Recon.
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoin

[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager for SCM/Recon

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Component/s: Ozone Datanode

> Adjust RetryPolicy of SCMConnectionManager for SCM/Recon
> 
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Datanode
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>  Labels: pull-request-available
>
> Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
>  RetryPolicies.retryForeverWithFixedSleep(
>  1000, TimeUnit.MILLISECONDS);
> StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> StorageContainerDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();{code}
>  that for Recon is retryUpToMaximumCountWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
> RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
> 6, TimeUnit.MILLISECONDS);
> ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> ReconDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();
> {code}
>  
> The executorService in DatanodeStateMachine is 
> Executors.newFixedThreadPool(...), whose default pool size is 2, one for 
> Recon, another for SCM.
>  
> When encounter rpc failure, call() of RegisterEndpointTask, 
> VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
> rpcEndpoint.lock(). For example:
> {code:java}
> public EndpointStateMachine.EndPointStates call() throws Exception {
>   rpcEndpoint.lock();
>   try {
> 
> SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
> .sendHeartbeat(request);
> 
>   } finally {
> rpcEndpoint.unlock();
>   }
>   return rpcEndpoint.getState();
> }
> {code}
>  
> *The problem is:*
> If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
> will be stale/dead very soon at SCM side.
>  
> *The root cause is:*
> The thread running Recon task will retry due to rpc failure, meanwhile holds 
> the lock of EndpointStateMachine for Recon. When DatanodeStateMachine 
> schedule the next round of SCM/Recon task, the only left thread will be 
> assigned to run Recon task, and blocked at waiting for the lock of 
> EndpointStateMachine for Recon.
> {code:java}
> public EndpointStateMachine.EndPointStates call() throws Exception {
>   rpcEndpoint.lock();
>   ...{code}
>  
> *The solution is:*
> Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may 
> adjust RetryPolicy so that won't retry for longer that 1min. 
> The change has no side effect:
> 1) VersionEndpointTask.call() is fine
> 2) RegisterEndpointTask.call() will query containerReport, nodeReport, 
> pipelineReports from OzoneContainer, which is fine.
> 3) HeartbeatEndpointTask.call() will putBackReports(), which is fine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager for SCM/Recon

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Summary: Adjust RetryPolicy of SCMConnectionManager for SCM/Recon  (was: 
Adjust RetryPolicy of SCMConnectionManager)

> Adjust RetryPolicy of SCMConnectionManager for SCM/Recon
> 
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
>  RetryPolicies.retryForeverWithFixedSleep(
>  1000, TimeUnit.MILLISECONDS);
> StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> StorageContainerDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();{code}
>  that for Recon is retryUpToMaximumCountWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
> RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
> 6, TimeUnit.MILLISECONDS);
> ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> ReconDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();
> {code}
>  
> The executorService in DatanodeStateMachine is 
> Executors.newFixedThreadPool(...), whose default pool size is 2, one for 
> Recon, another for SCM.
>  
> When encounter rpc failure, call() of RegisterEndpointTask, 
> VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
> rpcEndpoint.lock(). For example:
> {code:java}
> public EndpointStateMachine.EndPointStates call() throws Exception {
>   rpcEndpoint.lock();
>   try {
> 
> SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
> .sendHeartbeat(request);
> 
>   } finally {
> rpcEndpoint.unlock();
>   }
>   return rpcEndpoint.getState();
> }
> {code}
>  
> *The problem is:*
> If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
> will be stale/dead very soon at SCM side.
>  
> *The root cause is:*
> The thread running Recon task will retry due to rpc failure, meanwhile holds 
> the lock of EndpointStateMachine for Recon. When DatanodeStateMachine 
> schedule the next round of SCM/Recon task, the only left thread will be 
> assigned to run Recon task, and blocked at waiting for the lock of 
> EndpointStateMachine for Recon.
> {code:java}
> public EndpointStateMachine.EndPointStates call() throws Exception {
>   rpcEndpoint.lock();
>   ...{code}
>  
> *The solution is:*
> Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may 
> adjust RetryPolicy so that won't retry for longer that 1min. 
> The change has no side effect:
> 1) VersionEndpointTask.call() is fine
> 2) RegisterEndpointTask.call() will query containerReport, nodeReport, 
> pipelineReports from OzoneContainer, which is fine.
> 3) HeartbeatEndpointTask.call() will putBackReports(), which is fine.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Description: 
Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 that for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);

ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
 

The executorService in DatanodeStateMachine is 
Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, 
another for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock(). For example:
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();

  try {


SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
.sendHeartbeat(request);


  } finally {
rpcEndpoint.unlock();
  }

  return rpcEndpoint.getState();
}
{code}
 

*The problem is:*

If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
will be stale/dead very soon at SCM side.

 

*The root cause is:*

The thread running Recon task will retry due to rpc failure, meanwhile holds 
the lock of EndpointStateMachine for Recon. When DatanodeStateMachine schedule 
the next round of SCM/Recon task, the only left thread will be assigned to run 
Recon task, and blocked at waiting for the lock of EndpointStateMachine for 
Recon.
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();
  ...{code}
 

*The solution is:*

Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may 
adjust RetryPolicy so that won't retry for longer that 1min. 

The change has no side effect:

1) VersionEndpointTask.call() is fine

2) RegisterEndpointTask.call() will query containerReport, nodeReport, 
pipelineReports from OzoneContainer, which is fine.

3) HeartbeatEndpointTask.call() will putBackReports(), which is fine.

 

  was:
Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 that for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);

ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
 

The executorService in DatanodeStateMachine is 
Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, 
another for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock(). For example:

 
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();

  try {


SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
.sendHeartbeat(request);


  } finally {
rpcEndpoint.unlock();
  }

  return rpcEndpoint.getState();
}
{code}
 

 

*The problem is:* 

If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
will be stale/dead very soon at SCM side.

 

*The root cause is:*

The thread running Recon task will retry due to rpc failure, meanwhile holds 
the lock of EndpointStateMachine for Recon. When DatanodeStateMachine schedule 
the next round of SCM/Recon task, the only left thread will be assigned to run 
Recon task, and blocked at waiting for the lock of EndpointStateMachine for 
Recon.

 
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEn

[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Description: 
Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 that for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);

ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
 

The executorService in DatanodeStateMachine is 
Executors.newFixedThreadPool(...), whose default pool size is 2, one for Recon, 
another for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock(). For example:

 
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();

  try {


SCMHeartbeatResponseProto reponse = rpcEndpoint.getEndPoint()
.sendHeartbeat(request);


  } finally {
rpcEndpoint.unlock();
  }

  return rpcEndpoint.getState();
}
{code}
 

 

*The problem is:* 

If setup one Recon and one SCM, then shutdown the Recon server, all Datanodes 
will be stale/dead very soon at SCM side.

 

*The root cause is:*

The thread running Recon task will retry due to rpc failure, meanwhile holds 
the lock of EndpointStateMachine for Recon. When DatanodeStateMachine schedule 
the next round of SCM/Recon task, the only left thread will be assigned to run 
Recon task, and blocked at waiting for the lock of EndpointStateMachine for 
Recon.

 
{code:java}
public EndpointStateMachine.EndPointStates call() throws Exception {
  rpcEndpoint.lock();
  ...{code}
 

 

*The solution is:*

Since DatanodeStateMachine will periodically schedule SCM/Recon tasks, we may 
adjust RetryPolicy so that won't retry for longer that 1min. 

The change has no side effect:

1) VersionEndpointTask.call() is fine

2) RegisterEndpointTask.call() will query containerReport, nodeReport, 
pipelineReports from OzoneContainer, which is fine.

3) HeartbeatEndpointTask.call() will putBackReports(), which is fine.

 

  was:
Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 

for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);
ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
The executorService in DatanodeStateMachine is now 
Executors.newFixedThreadPool(...), whose pool size is 2, one for Recon, another 
for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock().

Here is the problem: if setup one Recon and one SCM, then shutdown the Recon 
server, all Datanodes will be stale/dead very soon. The root cause is that, the 
thread working for Recon will retry while holding the lock of 
EndpointStateMachine for Recon, when DatanodeStateMachine schedule the next 
round of task, the other thread is blocked by waiting for the lock of 
EndpointStateMachine for Recon.

 

Since DatanodeStateMachine will periodically schedule tasks, we may adjust 
RetryPolicy so that the execution of tasks no need to be longer than 1min.

 


> Adjust RetryPolicy of SCMConnectionManager
> --
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>

[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Description: 
Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
 RetryPolicies.retryForeverWithFixedSleep(
 1000, TimeUnit.MILLISECONDS);

StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
StorageContainerDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();{code}
 

for Recon is retryUpToMaximumCountWithFixedSleep:
{code:java}
RetryPolicy retryPolicy =
RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
6, TimeUnit.MILLISECONDS);
ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
ReconDatanodeProtocolPB.class, version,
address, UserGroupInformation.getCurrentUser(), hadoopConfig,
NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
retryPolicy).getProxy();
{code}
The executorService in DatanodeStateMachine is now 
Executors.newFixedThreadPool(...), whose pool size is 2, one for Recon, another 
for SCM.

 

When encounter rpc failure, call() of RegisterEndpointTask, 
VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
rpcEndpoint.lock().

Here is the problem: if setup one Recon and one SCM, then shutdown the Recon 
server, all Datanodes will be stale/dead very soon. The root cause is that, the 
thread working for Recon will retry while holding the lock of 
EndpointStateMachine for Recon, when DatanodeStateMachine schedule the next 
round of task, the other thread is blocked by waiting for the lock of 
EndpointStateMachine for Recon.

 

Since DatanodeStateMachine will periodically schedule tasks, we may adjust 
RetryPolicy so that the execution of tasks no need to be longer than 1min.

 

  was:
Teragen reported to be slow with low number of mappers compared to HDFS.

In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins but 
with Ozone it was 6 mins. It could be fixed with using more mappers, but when I 
investigated the execution I found a few problems reagrding to the BufferPool 
management.

 1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
itself is incremental
 2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
which can be a slow operation (positions should be calculated).
 3. There is no explicit support for write(byte) operations

In the flamegraph it's clearly visible that with low number of mappers the 
client is busy with buffer operations. After the patch the rpc call and the 
checksum calculation give the majority of the time. 


> Adjust RetryPolicy of SCMConnectionManager
> --
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> Current RetryPolicy of Datanode for SCM is retryForeverWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
>  RetryPolicies.retryForeverWithFixedSleep(
>  1000, TimeUnit.MILLISECONDS);
> StorageContainerDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> StorageContainerDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();{code}
>  
> for Recon is retryUpToMaximumCountWithFixedSleep:
> {code:java}
> RetryPolicy retryPolicy =
> RetryPolicies.retryUpToMaximumCountWithFixedSleep(10,
> 6, TimeUnit.MILLISECONDS);
> ReconDatanodeProtocolPB rpcProxy = RPC.getProtocolProxy(
> ReconDatanodeProtocolPB.class, version,
> address, UserGroupInformation.getCurrentUser(), hadoopConfig,
> NetUtils.getDefaultSocketFactory(hadoopConfig), getRpcTimeout(),
> retryPolicy).getProxy();
> {code}
> The executorService in DatanodeStateMachine is now 
> Executors.newFixedThreadPool(...), whose pool size is 2, one for Recon, 
> another for SCM.
>  
> When encounter rpc failure, call() of RegisterEndpointTask, 
> VersionEndpointTask, HeartbeatEndpointTask will retry while holding the 
> rpcEndpoint.lock().
> Here is the problem: if setup one Recon and one SCM, then shutdown the Recon 
> server, all Datanodes will be stale/dead very soon. The root cause is that, 
> the thread working for Recon will retry while holding the lock of 
> EndpointStateMachine for Recon, when DatanodeStateMachine schedule the next 
> round of task, the other thread is blocked by waiting for the lock of 
> EndpointStateMachine for Recon.
>  
> Since DatanodeStateMachine will periodically schedule ta

[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Labels:   (was: pull-request-available)

> Adjust RetryPolicy of SCMConnectionManager
> --
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>
> Teragen reported to be slow with low number of mappers compared to HDFS.
> In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins 
> but with Ozone it was 6 mins. It could be fixed with using more mappers, but 
> when I investigated the execution I found a few problems reagrding to the 
> BufferPool management.
>  1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
> itself is incremental
>  2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
> which can be a slow operation (positions should be calculated).
>  3. There is no explicit support for write(byte) operations
> In the flamegraph it's clearly visible that with low number of mappers the 
> client is busy with buffer operations. After the patch the rpc call and the 
> checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Summary: Adjust RetryPolicy of SCMConnectionManager  (was: CLONE - Improve 
performance of the BufferPool management of Ozone client)

> Adjust RetryPolicy of SCMConnectionManager
> --
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>  Labels: pull-request-available
>
> Teragen reported to be slow with low number of mappers compared to HDFS.
> In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins 
> but with Ozone it was 6 mins. It could be fixed with using more mappers, but 
> when I investigated the execution I found a few problems reagrding to the 
> BufferPool management.
>  1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
> itself is incremental
>  2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
> which can be a slow operation (positions should be calculated).
>  3. There is no explicit support for write(byte) operations
> In the flamegraph it's clearly visible that with low number of mappers the 
> client is busy with buffer operations. After the patch the rpc call and the 
> checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4186) Adjust RetryPolicy of SCMConnectionManager

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4186:

Target Version/s:   (was: 0.7.0)

> Adjust RetryPolicy of SCMConnectionManager
> --
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>  Labels: pull-request-available
>
> Teragen reported to be slow with low number of mappers compared to HDFS.
> In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins 
> but with Ozone it was 6 mins. It could be fixed with using more mappers, but 
> when I investigated the execution I found a few problems reagrding to the 
> BufferPool management.
>  1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
> itself is incremental
>  2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
> which can be a slow operation (positions should be calculated).
>  3. There is no explicit support for write(byte) operations
> In the flamegraph it's clearly visible that with low number of mappers the 
> client is busy with buffer operations. After the patch the rpc call and the 
> checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-4186) CLONE - Improve performance of the BufferPool management of Ozone client

2020-09-01 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng reassigned HDDS-4186:
---

Assignee: Glen Geng  (was: Marton Elek)

> CLONE - Improve performance of the BufferPool management of Ozone client
> 
>
> Key: HDDS-4186
> URL: https://issues.apache.org/jira/browse/HDDS-4186
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Blocker
>  Labels: pull-request-available
>
> Teragen reported to be slow with low number of mappers compared to HDFS.
> In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins 
> but with Ozone it was 6 mins. It could be fixed with using more mappers, but 
> when I investigated the execution I found a few problems reagrding to the 
> BufferPool management.
>  1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
> itself is incremental
>  2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
> which can be a slow operation (positions should be calculated).
>  3. There is no explicit support for write(byte) operations
> In the flamegraph it's clearly visible that with low number of mappers the 
> client is busy with buffer operations. After the patch the rpc call and the 
> checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4186) CLONE - Improve performance of the BufferPool management of Ozone client

2020-09-01 Thread Glen Geng (Jira)
Glen Geng created HDDS-4186:
---

 Summary: CLONE - Improve performance of the BufferPool management 
of Ozone client
 Key: HDDS-4186
 URL: https://issues.apache.org/jira/browse/HDDS-4186
 Project: Hadoop Distributed Data Store
  Issue Type: Improvement
Reporter: Glen Geng
Assignee: Marton Elek


Teragen reported to be slow with low number of mappers compared to HDFS.

In my test (one pipeline, 3 yarn nodes) 10 g teragen with HDFS was ~3 mins but 
with Ozone it was 6 mins. It could be fixed with using more mappers, but when I 
investigated the execution I found a few problems reagrding to the BufferPool 
management.

 1. IncrementalChunkBuffer is slow and it might not be required as BufferPool 
itself is incremental
 2. For each write operation the bufferPool.allocateBufferIfNeeded is called 
which can be a slow operation (positions should be calculated).
 3. There is no explicit support for write(byte) operations

In the flamegraph it's clearly visible that with low number of mappers the 
client is busy with buffer operations. After the patch the rpc call and the 
checksum calculation give the majority of the time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: 
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
follower SCM fails due to db exception, what can we do to ensure that states of 
leader and followers won't diverge, a.k.a. ensure the replicated StateMachine 
for leader and followers ?

We have to ensure Atomicity of ACID for state update: If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged. No partial change is allowed so that leader SCM can safely 
revert the state change for the whole raft groups.

Above analysis also applies to pipeline V2 and etc.

  was:
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
follower SCM fails due to db exception, what can we do to ensure that states of 
leader and followers won't diverge, a.k.a. ensure the replicated StateMachine 
for leader and followers ?

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies to pipeline V2 and etc.


> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> I have a concern about how to handle exceptions occurred in writing RocksDB 
> for container V2, such as allocateContainer, deleteContainer and 
> updateContainerState.
> For non-HA case, allocateContainer reverts the memory state changes if meet 
> IOException for db operations. deleteContainer and updateContainerState just 
> throw out the IOException and leave the memory state in an inconsistency 
> state.
> After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
> follower SCM fails due to db exception, what can we do to ensure that states 
> of leader and followers won't diverge, a.k.a. ensure the replicated 
> StateMachine for leader and followers ?
> We have to ensure Atomicity of ACID for state update: If any exception 
> occurred, SCM (no matter leader or follower) should throw exception and keep 
> states unchanged. No partial change is allowed so that leader SCM can safely 
> revert the state change for the whole raft groups.
> Above analysis also applies to pipeline V2 and etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: 
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
follower SCM fails due to db exception, what can we do to ensure that states of 
leader and followers won't diverge, a.k.a. ensure the replicated StateMachine 
for leader and followers ?

We have to ensure Atomicity of ACID for state update: If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged. No partial change is allowed so that leader SCM can safely 
revert the state change for the whole raft groups.

Above analysis also applies to pipeline V2 and other issues besides disk 
failure.

  was:
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
follower SCM fails due to db exception, what can we do to ensure that states of 
leader and followers won't diverge, a.k.a. ensure the replicated StateMachine 
for leader and followers ?

We have to ensure Atomicity of ACID for state update: If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged. No partial change is allowed so that leader SCM can safely 
revert the state change for the whole raft groups.

Above analysis also applies to pipeline V2 and etc.


> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> I have a concern about how to handle exceptions occurred in writing RocksDB 
> for container V2, such as allocateContainer, deleteContainer and 
> updateContainerState.
> For non-HA case, allocateContainer reverts the memory state changes if meet 
> IOException for db operations. deleteContainer and updateContainerState just 
> throw out the IOException and leave the memory state in an inconsistency 
> state.
> After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
> follower SCM fails due to db exception, what can we do to ensure that states 
> of leader and followers won't diverge, a.k.a. ensure the replicated 
> StateMachine for leader and followers ?
> We have to ensure Atomicity of ACID for state update: If any exception 
> occurred, SCM (no matter leader or follower) should throw exception and keep 
> states unchanged. No partial change is allowed so that leader SCM can safely 
> revert the state change for the whole raft groups.
> Above analysis also applies to pipeline V2 and other issues besides disk 
> failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: 
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
follower SCM fails due to db exception, what can we do to ensure that states of 
leader and followers won't diverge, a.k.a. ensure the replicated StateMachine 
for leader and followers ?

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies to pipeline V2 and etc.

  was:
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
follower SCM fails due to db exception, what can we do to ensure that states of 
leader and followers won't diverge, a.k.a. ensure the replicated StateMachine 
for leader and followers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies to pipeline V2 and etc.


> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> I have a concern about how to handle exceptions occurred in writing RocksDB 
> for container V2, such as allocateContainer, deleteContainer and 
> updateContainerState.
> For non-HA case, allocateContainer reverts the memory state changes if meet 
> IOException for db operations. deleteContainer and updateContainerState just 
> throw out the IOException and leave the memory state in an inconsistency 
> state.
> After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
> follower SCM fails due to db exception, what can we do to ensure that states 
> of leader and followers won't diverge, a.k.a. ensure the replicated 
> StateMachine for leader and followers ?
> We have to ensure Atomicity of ACID for state update. If any exception 
> occurred, SCM (no matter leader or follower) should throw exception and keep 
> states unchanged, so that leader SCM can safely revert the state change for 
> the whole raft groups.
> Above analysis also applies to pipeline V2 and etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: 
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
follower SCM fails due to db exception, what can we do to ensure that states of 
leader and followers won't diverge, a.k.a. ensure the replicated StateMachine 
for leader and followers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies to pipeline V2 and etc.

  was:
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
Follower SCM fails due to db exception, what can we do to ensure that states of 
leader and follower won't diverge, a.k.a., ensure the replicated state machine 
for leader and folowers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies to pipeline V2 and etc.


> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> I have a concern about how to handle exceptions occurred in writing RocksDB 
> for container V2, such as allocateContainer, deleteContainer and 
> updateContainerState.
> For non-HA case, allocateContainer reverts the memory state changes if meet 
> IOException for db operations. deleteContainer and updateContainerState just 
> throw out the IOException and leave the memory state in an inconsistency 
> state.
> After we enable SCM-HA, if leader SCM succeed the operation, meanwhile any 
> follower SCM fails due to db exception, what can we do to ensure that states 
> of leader and followers won't diverge, a.k.a. ensure the replicated 
> StateMachine for leader and followers.
> We have to ensure Atomicity of ACID for state update. If any exception 
> occurred, SCM (no matter leader or follower) should throw exception and keep 
> states unchanged, so that leader SCM can safely revert the state change for 
> the whole raft groups.
> Above analysis also applies to pipeline V2 and etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: 
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
Follower SCM fails due to db exception, what can we do to ensure that states of 
leader and follower won't diverge, a.k.a., ensure the replicated state machine 
for leader and folowers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies to pipeline V2 and etc.

  was:
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
Follower SCM fails due to db exception, what can we do to ensure that states of 
leader and follower won't diverge, a.k.a., ensure the replicated state machine 
for leader and folowers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies ot pipeline V2 and etc.


> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> I have a concern about how to handle exceptions occurred in writing RocksDB 
> for container V2, such as allocateContainer, deleteContainer and 
> updateContainerState.
> For non-HA case, allocateContainer reverts the memory state changes if meet 
> IOException for db operations. deleteContainer and updateContainerState just 
> throw out the IOException and leave the memory state in an inconsistency 
> state.
> After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
> Follower SCM fails due to db exception, what can we do to ensure that states 
> of leader and follower won't diverge, a.k.a., ensure the replicated state 
> machine for leader and folowers.
> We have to ensure Atomicity of ACID for state update. If any exception 
> occurred, SCM (no matter leader or follower) should throw exception and keep 
> states unchanged, so that leader SCM can safely revert the state change for 
> the whole raft groups.
> Above analysis also applies to pipeline V2 and etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: 
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state changes if meet 
IOException for db operations. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
Follower SCM fails due to db exception, what can we do to ensure that states of 
leader and follower won't diverge, a.k.a., ensure the replicated state machine 
for leader and folowers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies ot pipeline V2 and etc.

  was:
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state change if meet 
IOException for db operation. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
Follower SCM fails due to db exception, what can we do to ensure that states of 
leader and follower won't diverge, a.k.a., ensure the replicated state machine 
for leader and folowers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies ot pipeline V2 and etc.


> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> I have a concern about how to handle exceptions occurred in writing RocksDB 
> for container V2, such as allocateContainer, deleteContainer and 
> updateContainerState.
> For non-HA case, allocateContainer reverts the memory state changes if meet 
> IOException for db operations. deleteContainer and updateContainerState just 
> throw out the IOException and leave the memory state in an inconsistency 
> state.
> After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
> Follower SCM fails due to db exception, what can we do to ensure that states 
> of leader and follower won't diverge, a.k.a., ensure the replicated state 
> machine for leader and folowers.
> We have to ensure Atomicity of ACID for state update. If any exception 
> occurred, SCM (no matter leader or follower) should throw exception and keep 
> states unchanged, so that leader SCM can safely revert the state change for 
> the whole raft groups.
> Above analysis also applies ot pipeline V2 and etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: 
I have a concern about how to handle exceptions occurred in writing RocksDB for 
container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state change if meet 
IOException for db operation. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
Follower SCM fails due to db exception, what can we do to ensure that states of 
leader and follower won't diverge, a.k.a., ensure the replicated state machine 
for leader and folowers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies ot pipeline V2 and etc.

  was:
I have a concern about how to handling exceptions occurred in writing RocksDB 
for container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state change if meet 
IOException for db operation. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
Follower SCM fails due to db exception, what can we do to ensure that states of 
leader and follower won't diverge, a.k.a., ensure the replicated state machine 
for leader and folowers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies ot pipeline V2 and etc.


> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> I have a concern about how to handle exceptions occurred in writing RocksDB 
> for container V2, such as allocateContainer, deleteContainer and 
> updateContainerState.
> For non-HA case, allocateContainer reverts the memory state change if meet 
> IOException for db operation. deleteContainer and updateContainerState just 
> throw out the IOException and leave the memory state in an inconsistency 
> state.
> After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
> Follower SCM fails due to db exception, what can we do to ensure that states 
> of leader and follower won't diverge, a.k.a., ensure the replicated state 
> machine for leader and folowers.
> We have to ensure Atomicity of ACID for state update. If any exception 
> occurred, SCM (no matter leader or follower) should throw exception and keep 
> states unchanged, so that leader SCM can safely revert the state change for 
> the whole raft groups.
> Above analysis also applies ot pipeline V2 and etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: (was:  

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 

In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
memory state (container maps in memory) are protected by raft, and should keep 
their consistency upon each modification.)

> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Description: 
I have a concern about how to handling exceptions occurred in writing RocksDB 
for container V2, such as allocateContainer, deleteContainer and 
updateContainerState.

For non-HA case, allocateContainer reverts the memory state change if meet 
IOException for db operation. deleteContainer and updateContainerState just 
throw out the IOException and leave the memory state in an inconsistency state.

After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
Follower SCM fails due to db exception, what can we do to ensure that states of 
leader and follower won't diverge, a.k.a., ensure the replicated state machine 
for leader and folowers.

We have to ensure Atomicity of ACID for state update. If any exception 
occurred, SCM (no matter leader or follower) should throw exception and keep 
states unchanged, so that leader SCM can safely revert the state change for the 
whole raft groups.

Above analysis also applies ot pipeline V2 and etc.

> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> I have a concern about how to handling exceptions occurred in writing RocksDB 
> for container V2, such as allocateContainer, deleteContainer and 
> updateContainerState.
> For non-HA case, allocateContainer reverts the memory state change if meet 
> IOException for db operation. deleteContainer and updateContainerState just 
> throw out the IOException and leave the memory state in an inconsistency 
> state.
> After we enable SCM-HA, if Leader SCM succeed the operation, meanwhile any 
> Follower SCM fails due to db exception, what can we do to ensure that states 
> of leader and follower won't diverge, a.k.a., ensure the replicated state 
> machine for leader and folowers.
> We have to ensure Atomicity of ACID for state update. If any exception 
> occurred, SCM (no matter leader or follower) should throw exception and keep 
> states unchanged, so that leader SCM can safely revert the state change for 
> the whole raft groups.
> Above analysis also applies ot pipeline V2 and etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state update for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Summary: Design for Error/Exception handling in state update for 
container/pipeline V2  (was: Design for Error/Exception handling in state 
updates for container/pipeline V2)

> Design for Error/Exception handling in state update for container/pipeline V2
> -
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
>  
> Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 
> In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
> memory state (container maps in memory) are protected by raft, and should 
> keep their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4136) Design for Error/Exception handling in state updates for container/pipeline V2

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4136:

Summary: Design for Error/Exception handling in state updates for 
container/pipeline V2  (was: CLONE - In ContainerStateManagerV2, modification 
of RocksDB should be consistent with that of memory state.)

> Design for Error/Exception handling in state updates for container/pipeline V2
> --
>
> Key: HDDS-4136
> URL: https://issues.apache.org/jira/browse/HDDS-4136
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
>  
> Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 
> In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
> memory state (container maps in memory) are protected by raft, and should 
> keep their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4136) CLONE - In ContainerStateManagerV2, modification of RocksDB should be consistent with that of memory state.

2020-08-24 Thread Glen Geng (Jira)
Glen Geng created HDDS-4136:
---

 Summary: CLONE - In ContainerStateManagerV2, modification of 
RocksDB should be consistent with that of memory state.
 Key: HDDS-4136
 URL: https://issues.apache.org/jira/browse/HDDS-4136
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
  Components: SCM
Reporter: Glen Geng
Assignee: Glen Geng


 

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 

In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
memory state (container maps in memory) are protected by raft, and should keep 
their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4135) In ContainerStateManagerV2, modification of RocksDB should be consistent with that of memory state.

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4135:

Summary: In ContainerStateManagerV2, modification of RocksDB should be 
consistent with that of memory state.  (was: In ContainerStateManagerV2, 
modification of RocksDB should be in consistency with that of memory state.)

> In ContainerStateManagerV2, modification of RocksDB should be consistent with 
> that of memory state.
> ---
>
> Key: HDDS-4135
> URL: https://issues.apache.org/jira/browse/HDDS-4135
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: pull-request-available
>
>  
> Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895
>  
> In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
> memory state (container maps in memory) are protected by raft, and should 
> keep their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4135) In ContainerStateManagerV2, modification of RocksDB should be in consistency with that of memory state.

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4135:

Description: 
 

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895

 

In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
memory state (container maps in memory) are protected by raft, and should keep 
their consistency upon each modification.

  was:
 

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895

 

 


> In ContainerStateManagerV2, modification of RocksDB should be in consistency 
> with that of memory state.
> ---
>
> Key: HDDS-4135
> URL: https://issues.apache.org/jira/browse/HDDS-4135
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: pull-request-available
>
>  
> Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895
>  
> In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
> memory state (container maps in memory) are protected by raft, and should 
> keep their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4135) In ContainerStateManagerV2, modification of RocksDB should be consistent with that of memory state.

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4135:

Labels:   (was: pull-request-available)

> In ContainerStateManagerV2, modification of RocksDB should be consistent with 
> that of memory state.
> ---
>
> Key: HDDS-4135
> URL: https://issues.apache.org/jira/browse/HDDS-4135
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
>  
> Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895
>  
> In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
> memory state (container maps in memory) are protected by raft, and should 
> keep their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4135) In ContainerStateManagerV2, modification of RocksDB should be consistent with that of memory state.

2020-08-24 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4135:

Description: 
 

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 

In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
memory state (container maps in memory) are protected by raft, and should keep 
their consistency upon each modification.

  was:
 

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895

 

In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
memory state (container maps in memory) are protected by raft, and should keep 
their consistency upon each modification.


> In ContainerStateManagerV2, modification of RocksDB should be consistent with 
> that of memory state.
> ---
>
> Key: HDDS-4135
> URL: https://issues.apache.org/jira/browse/HDDS-4135
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
>  
> Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895 
> In ContainerStateManagerV2, both disk state (column families in RocksDB) and 
> memory state (container maps in memory) are protected by raft, and should 
> keep their consistency upon each modification.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4135) In ContainerStateManagerV2, modification of RocksDB should be in consistency with that of memory state.

2020-08-23 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4135:

Description: 
 

Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895

 

 

  was:
The 1st edition of RatisServer of SCM HA is copied from OM HA.

This version is abandoned, since we finally choose the RatisServer that is 
implemented by InvocationHandler.

This Jira is used to remove the 1st edition RatisServer.


> In ContainerStateManagerV2, modification of RocksDB should be in consistency 
> with that of memory state.
> ---
>
> Key: HDDS-4135
> URL: https://issues.apache.org/jira/browse/HDDS-4135
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: pull-request-available
>
>  
> Fix a bug in https://issues.apache.org/jira/browse/HDDS-3895
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-4135) In ContainerStateManagerV2, modification of RocksDB should be in consistency with that of memory state.

2020-08-23 Thread Glen Geng (Jira)
Glen Geng created HDDS-4135:
---

 Summary: In ContainerStateManagerV2, modification of RocksDB 
should be in consistency with that of memory state.
 Key: HDDS-4135
 URL: https://issues.apache.org/jira/browse/HDDS-4135
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
  Components: SCM
Reporter: Glen Geng
Assignee: Glen Geng


The 1st edition of RatisServer of SCM HA is copied from OM HA.

This version is abandoned, since we finally choose the RatisServer that is 
implemented by InvocationHandler.

This Jira is used to remove the 1st edition RatisServer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-4125) Pipeline is not removed when a datanode goes stale

2020-08-20 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng resolved HDDS-4125.
-
Resolution: Fixed

> Pipeline is not removed when a datanode goes stale
> --
>
> Key: HDDS-4125
> URL: https://issues.apache.org/jira/browse/HDDS-4125
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM HA
>Reporter: Nanda kumar
>Assignee: Glen Geng
>Priority: Major
>  Labels: pull-request-available
>
> When a node goes stale the pipelines in that node have to be closed and 
> removed from {{PipelineManager}}. Currently, the pipeline is only closed and 
> left in {{PipelineManager}}.
>  
> *Root Cause Analysis* 
> Since the Scheduler in SCMPipelineManager that used to destroyPipeline is 
> removed,
> {code:java}
> scheduler.schedule(() -> destroyPipeline(pipeline),
> pipelineDestroyTimeoutInMillis, TimeUnit.MILLISECONDS, LOG,
> String.format("Destroy pipeline failed for pipeline:%s", pipeline));{code}
> meanwhile the PipelineManagerV2Impl::scrubPipeline only handles and remove 
> RATIS THREE pipeline,
> {code:java}
> public void scrubPipeline(ReplicationType type, ReplicationFactor factor)
> throws IOException {
>   checkLeader();
>   if (type != ReplicationType.RATIS || factor != ReplicationFactor.THREE) {
> // Only srub pipeline for RATIS THREE pipeline
> return;
>   }
> {code}
>  RATIS ONE Pipeline is closed but not removed when a datanode goes stale. The 
> solution is let scrubPipeline handle all kinds of pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-4130) remove the 1st edition of RatisServer of SCM HA which is copied from OM HA

2020-08-20 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng resolved HDDS-4130.
-
Resolution: Fixed

> remove the 1st edition of RatisServer of SCM HA which is copied from OM HA
> --
>
> Key: HDDS-4130
> URL: https://issues.apache.org/jira/browse/HDDS-4130
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: pull-request-available
>
> The 1st edition of RatisServer of SCM HA is copied from OM HA.
> This version is abandoned, since we finally choose the RatisServer that is 
> implemented by InvocationHandler.
> This Jira is used to remove the 1st edition RatisServer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4130) remove the 1st edition of RatisServer of SCM HA which is copied from OM HA

2020-08-19 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4130:

Description: 
The 1st edition of RatisServer of SCM HA is copied from OM HA.

This version is abandoned, since we finally choose the RatisServer that is 
implemented by InvocationHandler.

This Jira is used to remove the 1st edition RatisServer.

  was:
The 1st edition of RatisServer of SCM HA is copied from OM HA.

This version abandoned, since we finally choose the RatisServer that 
implemented by InvocationHandler.

This Jira is used to remove the 1st edition RatisServer.


> remove the 1st edition of RatisServer of SCM HA which is copied from OM HA
> --
>
> Key: HDDS-4130
> URL: https://issues.apache.org/jira/browse/HDDS-4130
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> The 1st edition of RatisServer of SCM HA is copied from OM HA.
> This version is abandoned, since we finally choose the RatisServer that is 
> implemented by InvocationHandler.
> This Jira is used to remove the 1st edition RatisServer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4130) remove the 1st edition of RatisServer of SCM HA which is copied from OM HA

2020-08-19 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4130:

Description: 
The 1st edition of RatisServer of SCM HA is copied from OM HA.

This version abandoned, since we finally choose the RatisServer that 
implemented by InvocationHandler.

This Jira is used to remove the 1st edition RatisServer.

  was:
The 1st edition of RatisServer of SCM HA is copied from OM HA.

This version abandoned, since we finally choose the RatisServer that 
implemented by InvocationHandler.


> remove the 1st edition of RatisServer of SCM HA which is copied from OM HA
> --
>
> Key: HDDS-4130
> URL: https://issues.apache.org/jira/browse/HDDS-4130
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: backward-incompatible, pull-request-available, upgrade
>
> The 1st edition of RatisServer of SCM HA is copied from OM HA.
> This version abandoned, since we finally choose the RatisServer that 
> implemented by InvocationHandler.
> This Jira is used to remove the 1st edition RatisServer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4130) remove the 1st edition of RatisServer of SCM HA which is copied from OM HA

2020-08-19 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4130:

Labels:   (was: backward-incompatible pull-request-available upgrade)

> remove the 1st edition of RatisServer of SCM HA which is copied from OM HA
> --
>
> Key: HDDS-4130
> URL: https://issues.apache.org/jira/browse/HDDS-4130
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>
> The 1st edition of RatisServer of SCM HA is copied from OM HA.
> This version abandoned, since we finally choose the RatisServer that 
> implemented by InvocationHandler.
> This Jira is used to remove the 1st edition RatisServer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4130) remove the 1st edition of RatisServer of SCM HA which is copied from OM HA

2020-08-19 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4130:

Description: 
The 1st edition of RatisServer of SCM HA is copied from OM HA.

This version abandoned, since we finally choose the RatisServer that 
implemented by InvocationHandler.

  was:The 1st edition RatisServer of SCM HA is copied 


> remove the 1st edition of RatisServer of SCM HA which is copied from OM HA
> --
>
> Key: HDDS-4130
> URL: https://issues.apache.org/jira/browse/HDDS-4130
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: backward-incompatible, pull-request-available, upgrade
>
> The 1st edition of RatisServer of SCM HA is copied from OM HA.
> This version abandoned, since we finally choose the RatisServer that 
> implemented by InvocationHandler.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4130) remove the 1st edition of RatisServer of SCM HA which is copied from OM HA

2020-08-19 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4130:

Summary: remove the 1st edition of RatisServer of SCM HA which is copied 
from OM HA  (was: remove the 1st edition RatisServer of SCM HA which is copied 
from OM HA)

> remove the 1st edition of RatisServer of SCM HA which is copied from OM HA
> --
>
> Key: HDDS-4130
> URL: https://issues.apache.org/jira/browse/HDDS-4130
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: backward-incompatible, pull-request-available, upgrade
>
> The 1st edition RatisServer of SCM HA is copied 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4130) remove the 1st edition RatisServer of SCM HA which is copied from OM HA

2020-08-19 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4130:

Summary: remove the 1st edition RatisServer of SCM HA which is copied from 
OM HA  (was: remove the 1st edition RatisServer of SCM HA, which is copied from 
OM HA)

> remove the 1st edition RatisServer of SCM HA which is copied from OM HA
> ---
>
> Key: HDDS-4130
> URL: https://issues.apache.org/jira/browse/HDDS-4130
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: backward-incompatible, pull-request-available, upgrade
>
> The 1st edition RatisServer of SCM HA is copied 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-4130) remove the 1st edition RatisServer of SCM HA, which is copied from OM HA

2020-08-19 Thread Glen Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-4130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Glen Geng updated HDDS-4130:

Description: The 1st edition RatisServer of SCM HA is copied   (was: The 
disk layout per volume is as follows:
{code:java}
../hdds/VERSION
../hdds/<>/current/<>/<>/metadata
../hdds/<>/current/<>/<>/<>{code}
However, after SCM-HA is enabled, a typical SCM group will consists of 3 SCMs, 
each of the SCMs has its own scmUuid, meanwhile share the same clusterID.

Since federation is not supported yet, only one cluster is supported now, this 
Jira will change scmID to clusterID for container and volume at Datanode side.

The disk layout after the change will be as follows:
{code:java}
../hdds/VERSION
../hdds/<>/current/<>/<>/metadata
../hdds/<>/current/<>/<>/<>{code})

> remove the 1st edition RatisServer of SCM HA, which is copied from OM HA
> 
>
> Key: HDDS-4130
> URL: https://issues.apache.org/jira/browse/HDDS-4130
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Glen Geng
>Assignee: Glen Geng
>Priority: Major
>  Labels: backward-incompatible, pull-request-available, upgrade
>
> The 1st edition RatisServer of SCM HA is copied 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org



  1   2   >