[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

2020-03-12 Thread Attila Doroszlai (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17057707#comment-17057707
 ] 

Attila Doroszlai commented on HDDS-3116:


Committed to master.  [~dineshchitlangia] please resolve as you see fit.

> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> -
>
> Key: HDDS-3116
> URL: https://issues.apache.org/jira/browse/HDDS-3116
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Blocker
> Fix For: 0.6.0
>
> Attachments: full_logs.txt
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While working on a network Topology test (HDDS-3084) which does the following:
> 1. Start a cluster with 6 DNs and 2 racks.
> 2. Create a volume, bucket and a single key.
> 3. Stop one rack of hosts using "docker-compose down"
> 4. Read the data from the single key
> 5. Start the 3 down hosts
> 6. Stop the other 3 hosts
> 7. Attempt to read the key again.
> At step 5 I sometimes see this stack trace in one of the DNs and it fails to 
> full come up:
> {code}
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Attempting to start container services.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ratis.XceiverServerRatis: Starting XceiverServerRatis 
> 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
> statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
> scm:9861 for past 15000 seconds.
> java.io.IOException: java.lang.NullPointerException
>   at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>   at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>   at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
>   at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
>   at 
> org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
>   at org.apache.ratis.server.impl.ServerState.(ServerState.java:112)
>   at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:112)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>   at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>   ... 3 more
> {code}
> The DN does not recover from this automatically, although I confirmed that a 
> full cluster restart fixed it (docker-compose stop; docker-compose start). I 
> will try to confirm if a restart of the stuck DN would fix it or not too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: 

[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

2020-03-11 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056781#comment-17056781
 ] 

Stephen O'Donnell commented on HDDS-3116:
-

I could reproduce this problem quite easily in docker before adding the 
synchronised fix, and afterwards I have not been able to reproduce it using the 
same technique.

I think long term this code should be refactored, but the synchronized blocks 
seem to address the immediate issue. It would be easily for another change to 
introduce more issues in this area due to these circular dependencies however.

The reason the sync blocks work, is because XceiverServerRatis ultimately tries 
to get the OzoneContainer instance via the DatanodeStateMachine instance in 
another thread while both those objects are constructing. It is the constructor 
of OzoneContainer which starts the XceiverServerRatis, so synchronising around 
the construction of OzoneContainer and its accessor in DatanodeStateMachine 
forces the XceiverServer to wait until construction is finished before it can 
get what it needs.


> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> -
>
> Key: HDDS-3116
> URL: https://issues.apache.org/jira/browse/HDDS-3116
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: full_logs.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While working on a network Topology test (HDDS-3084) which does the following:
> 1. Start a cluster with 6 DNs and 2 racks.
> 2. Create a volume, bucket and a single key.
> 3. Stop one rack of hosts using "docker-compose down"
> 4. Read the data from the single key
> 5. Start the 3 down hosts
> 6. Stop the other 3 hosts
> 7. Attempt to read the key again.
> At step 5 I sometimes see this stack trace in one of the DNs and it fails to 
> full come up:
> {code}
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Attempting to start container services.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ratis.XceiverServerRatis: Starting XceiverServerRatis 
> 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
> statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
> scm:9861 for past 15000 seconds.
> java.io.IOException: java.lang.NullPointerException
>   at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>   at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>   at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
>   at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
>   at 
> org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
>   at 

[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

2020-03-10 Thread Hanisha Koneru (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17056541#comment-17056541
 ] 

Hanisha Koneru commented on HDDS-3116:
--

There is a lot of circular dependency between OzoneContainer, 
XceiverServerRatis and StateContext. AFAICS it might not be a trivial change to 
fix this.

Not sure if adding the synchronization will resolve the issue altogether. I 
tried reproing but couldn't. Will try reproing on a docker cluster.

> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> -
>
> Key: HDDS-3116
> URL: https://issues.apache.org/jira/browse/HDDS-3116
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Blocker
>  Labels: pull-request-available
> Attachments: full_logs.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While working on a network Topology test (HDDS-3084) which does the following:
> 1. Start a cluster with 6 DNs and 2 racks.
> 2. Create a volume, bucket and a single key.
> 3. Stop one rack of hosts using "docker-compose down"
> 4. Read the data from the single key
> 5. Start the 3 down hosts
> 6. Stop the other 3 hosts
> 7. Attempt to read the key again.
> At step 5 I sometimes see this stack trace in one of the DNs and it fails to 
> full come up:
> {code}
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Attempting to start container services.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ratis.XceiverServerRatis: Starting XceiverServerRatis 
> 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
> statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
> scm:9861 for past 15000 seconds.
> java.io.IOException: java.lang.NullPointerException
>   at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>   at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>   at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
>   at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
>   at 
> org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
>   at org.apache.ratis.server.impl.ServerState.(ServerState.java:112)
>   at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:112)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>   at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>   ... 3 more
> {code}
> The DN does not recover from this automatically, although I confirmed that a 
> full cluster restart fixed it (docker-compose stop; docker-compose start). I 
> will try to confirm if a restart of the stuck DN would fix it or not too.



--
This message 

[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

2020-03-09 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054831#comment-17054831
 ] 

Stephen O'Donnell commented on HDDS-3116:
-

Unsure if this should be a blocker for 0.5. In my tests (docker compose 
environment), this problem happens regularly on DN restart, so I am surprised 
it has not come up before. I don't think this is a newly introduced issue 
either - it looks to have been around for a long time and just discovered now.

The fix I have posted is pretty trivial, but we may want to consider a bigger 
refactor.

It would be best if we can fix it as part of the 0.5 release, so I will mark as 
a blocker and let [~dineshchitlangia] triage and downgrade if needed.

> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> -
>
> Key: HDDS-3116
> URL: https://issues.apache.org/jira/browse/HDDS-3116
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Attachments: full_logs.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While working on a network Topology test (HDDS-3084) which does the following:
> 1. Start a cluster with 6 DNs and 2 racks.
> 2. Create a volume, bucket and a single key.
> 3. Stop one rack of hosts using "docker-compose down"
> 4. Read the data from the single key
> 5. Start the 3 down hosts
> 6. Stop the other 3 hosts
> 7. Attempt to read the key again.
> At step 5 I sometimes see this stack trace in one of the DNs and it fails to 
> full come up:
> {code}
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Attempting to start container services.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ratis.XceiverServerRatis: Starting XceiverServerRatis 
> 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
> statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
> scm:9861 for past 15000 seconds.
> java.io.IOException: java.lang.NullPointerException
>   at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>   at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>   at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
>   at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
>   at 
> org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
>   at org.apache.ratis.server.impl.ServerState.(ServerState.java:112)
>   at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:112)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>   at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>   ... 3 more
> {code}
> The 

[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

2020-03-09 Thread Marton Elek (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17054812#comment-17054812
 ] 

Marton Elek commented on HDDS-3116:
---

Is it a blocker for release 0.5.0? If yes, can we plese use *Blocker* priority 
and *0.5.0* target version (and sync with [~dineshchitlangia]).

> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> -
>
> Key: HDDS-3116
> URL: https://issues.apache.org/jira/browse/HDDS-3116
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Attachments: full_logs.txt
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While working on a network Topology test (HDDS-3084) which does the following:
> 1. Start a cluster with 6 DNs and 2 racks.
> 2. Create a volume, bucket and a single key.
> 3. Stop one rack of hosts using "docker-compose down"
> 4. Read the data from the single key
> 5. Start the 3 down hosts
> 6. Stop the other 3 hosts
> 7. Attempt to read the key again.
> At step 5 I sometimes see this stack trace in one of the DNs and it fails to 
> full come up:
> {code}
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Attempting to start container services.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ratis.XceiverServerRatis: Starting XceiverServerRatis 
> 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
> statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
> scm:9861 for past 15000 seconds.
> java.io.IOException: java.lang.NullPointerException
>   at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>   at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>   at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:418)
>   at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:232)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
>   at 
> org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:757)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:739)
>   at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
>   at 
> org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
>   at org.apache.ratis.server.impl.ServerState.(ServerState.java:112)
>   at 
> org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:112)
>   at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
>   at 
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
>   ... 3 more
> {code}
> The DN does not recover from this automatically, although I confirmed that a 
> full cluster restart fixed it (docker-compose stop; docker-compose start). I 
> will try to confirm if a restart of the stuck DN would fix it or not too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: 

[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

2020-03-03 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050463#comment-17050463
 ] 

Stephen O'Donnell commented on HDDS-3116:
-

The code gets somewhat hard to following when it gets into Ratis, but from what 
I gather Ratis attempts to create a new RaftServerImpl.

This is actually a CompleteableFuture and the future is stored into the ImplMap.

The future gets this exception when it is execute, which it stores away for 
later (capture with an extra debug message I added:

{code}
2020-03-03 16:12:53,098 [pool-10-thread-1] ERROR ratis.XceiverServerRatis: 
Caught a NPE: 
java.lang.NullPointerException
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:766)
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:742)
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
at 
org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
at org.apache.ratis.server.impl.ServerState.(ServerState.java:112)
at 
org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:112)
at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
{code}

Then the Datanode State Machine Thread keeps trying over and over to start the 
Impl, but as the future failed, it just repeats the same exception over and 
over. The impl is never recreated, which is why once it fails it can never 
recover even though the OzoneContainer inside the DataodeStateMachine is now 
populated:

{code}
2020-03-03 16:12:56,813 [Datanode State Machine Thread - 0] WARN 
statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
scm:9861 for past 0 seconds.
java.io.IOException: java.lang.NullPointerException
at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70)
at 
org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284)
at 
org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296)
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:421)
at 
org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:237)
at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:113)
at 
org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.sendPipelineReport(XceiverServerRatis.java:771)
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.notifyGroupAdd(XceiverServerRatis.java:742)
at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.initialize(ContainerStateMachine.java:218)
at 
org.apache.ratis.server.impl.ServerState.initStatemachine(ServerState.java:160)
at org.apache.ratis.server.impl.ServerState.(ServerState.java:112)
at 
org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:112)
at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208)
at 
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
{code}

It feels like this is a seperate problem within ratis, but it all stems from 
the same race condition.

> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> 

[jira] [Commented] (HDDS-3116) Datanode sometimes fails to start with NPE when starting Ratis xceiver server

2020-03-03 Thread Stephen O'Donnell (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-3116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17050280#comment-17050280
 ] 

Stephen O'Donnell commented on HDDS-3116:
-

This problem is actually quite easy to reproduce, simply by starting and 
stopping a datanode docker container over and over until the problem happens. A 
futher restart makes the problem go away. That led me to believe this is a race 
condition in the initialisation code rather than some bad state stored on the 
DN.

Adding a bit of extra debug, I believe the problem is created in the 
constructor of the DatanodeStateMachine class.

The problem manifests in this line in XceiverServerRatis:

{code}
context.addReport(context.getParent().getContainer().getPipelineReport());
{code}

Adding some debug here, I proved the NPE is happening as getContainer() returns 
null.

This method is calling an instance of DatanodeStateMachine and in that class 
you can see the OzoneContainer variable is set in the constructor and never 
changed, except on shutdown.

Checking that constructor, I can see the following:

{code}
  public DatanodeStateMachine(DatanodeDetails datanodeDetails,
  Configuration conf, CertificateClient certClient,
  HddsDatanodeStopService hddsDatanodeStopService) throws IOException {
OzoneConfiguration ozoneConf = new OzoneConfiguration(conf);
DatanodeConfiguration dnConf =
ozoneConf.getObject(DatanodeConfiguration.class);

this.hddsDatanodeStopService = hddsDatanodeStopService;
this.conf = conf;
this.datanodeDetails = datanodeDetails;
executorService = HadoopExecutors.newCachedThreadPool(
new ThreadFactoryBuilder().setDaemon(true)
.setNameFormat("Datanode State Machine Thread - %d").build());
connectionManager = new SCMConnectionManager(conf);
context = new StateContext(this.conf, DatanodeStates.getInitState(), this);
container = new OzoneContainer(this.datanodeDetails,
   ozoneConf, context, certClient);
...
{code}

Note the second last line - the current object **which has not finished 
constructing** is passed to create a new StateContext. This context is then 
passed to the new OzoneContainer call on the last line.

Following the path further, in the constructor for OzoneContainer we can see it 
calls the following and passes the context to it:

{code}
this.writeChannel = XceiverServerRatis.newXceiverServerRatis(
datanodeDetails, config, hddsDispatcher, controller, certClient,
context);
{code}

Which creates the XceiverServerRatis, which has its own thread pool. If this 
Ratis server calls sendPipelineReport before the OzoneContainerConstructor and 
DatanodeStateMachine constructors complete, this bug occurs and the DN will 
never startup.

The question is how to fix this as ultimately the RatisServer startup code is 
accessing two other objects that have not completed construction. We might be 
able to put some synchronization in to help with a quick fix.


> Datanode sometimes fails to start with NPE when starting Ratis xceiver server
> -
>
> Key: HDDS-3116
> URL: https://issues.apache.org/jira/browse/HDDS-3116
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Stephen O'Donnell
>Assignee: Nanda kumar
>Priority: Major
> Attachments: full_logs.txt
>
>
> While working on a network Topology test (HDDS-3084) which does the following:
> 1. Start a cluster with 6 DNs and 2 racks.
> 2. Create a volume, bucket and a single key.
> 3. Stop one rack of hosts using "docker-compose down"
> 4. Read the data from the single key
> 5. Start the 3 down hosts
> 6. Stop the other 3 hosts
> 7. Attempt to read the key again.
> At step 5 I sometimes see this stack trace in one of the DNs and it fails to 
> full come up:
> {code}
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Attempting to start container services.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ozoneimpl.OzoneContainer: Background container scanner has been disabled.
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] INFO 
> ratis.XceiverServerRatis: Starting XceiverServerRatis 
> 8c1178dd-c44d-49d1-b899-cc3e40ae8f23 at port 9858
> 2020-03-02 13:01:31,887 [Datanode State Machine Thread - 0] WARN 
> statemachine.EndpointStateMachine: Unable to communicate to SCM server at 
> scm:9861 for past 15000 seconds.
> java.io.IOException: java.lang.NullPointerException
>   at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54)
>   at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61)
>   at