[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-09-11 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162330#comment-16162330
 ] 

Weiwei Yang commented on HDFS-12098:


Hi [~anu], [~vagarychen]

Thanks for revisiting this, I could not reproduce this either on latest code 
base, looks like this was fixed by some other patches. This seems no longer a 
valid issue, I think we can close it. Thanks for spending time trying to 
reproduce this.

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Affects Versions: HDFS-7240
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
>  Labels: ozoneMerge
> Fix For: HDFS-7240
>
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, 
> HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 
> PM.png, thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-09-11 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161994#comment-16161994
 ] 

Anu Engineer commented on HDFS-12098:
-

@weiwei yang, I was talking to [~vagarychen] offline and he was thinking this 
works for him. Would you be able to cross check if this is still broken?

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
>  Labels: ozoneMerge
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, 
> HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 
> PM.png, thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-17 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091052#comment-16091052
 ] 

Weiwei Yang commented on HDFS-12098:


Oh [~anu], no problem at all. Thanks for your quick reply.


> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, 
> HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 
> PM.png, thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-17 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091035#comment-16091035
 ] 

Anu Engineer commented on HDFS-12098:
-

[~cheersyang] Sorry, I have not gotten to this yet. I will take a look at this 
soon. I have been trying to clear up the code review backlogs. 

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, 
> HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 
> PM.png, thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-17 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16091004#comment-16091004
 ] 

Weiwei Yang commented on HDFS-12098:


Hi [~anu]

Have you tried to reproduce this issue or apply the test case patch I uploaded 
to take a look at the issue ? Please let me know, thanks.

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase-1.patch, 
> HDFS-12098-HDFS-7240.testcase.patch, Screen Shot 2017-07-11 at 4.58.08 
> PM.png, thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089723#comment-16089723
 ] 

Hadoop QA commented on HDFS-12098:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  1m  
3s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} HDFS-7240 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
27s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
58s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
52s{color} | {color:green} HDFS-7240 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
56s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 39s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 1 new + 154 unchanged - 0 fixed = 155 total (was 154) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 71m  8s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}101m  4s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | 
hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting |
|   | hadoop.ozone.TestMiniOzoneCluster |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 |
|   | hadoop.ozone.container.replication.TestContainerReplicationManager |
|   | hadoop.ozone.container.ozoneimpl.TestOzoneContainer |
|   | hadoop.ozone.TestStorageContainerManager |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure |
| Timed out junit tests | 
org.apache.hadoop.ozone.container.ozoneimpl.TestRatisManager |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HDFS-12098 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12877553/HDFS-12098-HDFS-7240.testcase-1.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 6ea999e772d2 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 
12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | HDFS-7240 / 1bec6a1 |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20304/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20304/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20304/testReport/ |
| modules 

[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089336#comment-16089336
 ] 

Hadoop QA commented on HDFS-12098:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
19s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} HDFS-7240 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 
39s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
52s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} HDFS-7240 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 36s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 16 new + 154 unchanged - 0 fixed = 170 total (was 154) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  2m  
0s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 
unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 23s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}102m 13s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs-project/hadoop-hdfs |
|  |  Inconsistent synchronization of 
org.apache.hadoop.hdfs.server.datanode.DataNode.datanodeStateMachine; locked 
42% of time  Unsynchronized access at DataNode.java:42% of time  Unsynchronized 
access at DataNode.java:[line 3228] |
| Failed junit tests | 
hadoop.ozone.container.replication.TestContainerReplicationManager |
|   | hadoop.ozone.TestMiniOzoneCluster |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 |
|   | hadoop.ozone.TestStorageContainerManager |
| Timed out junit tests | org.apache.hadoop.hdfs.TestLeaseRecovery2 |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HDFS-12098 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12877520/HDFS-12098-HDFS-7240.testcase.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux a4fe1c2f42ae 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 
14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | HDFS-7240 / 1bec6a1 |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20299/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| findbugs | 

[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-16 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16089283#comment-16089283
 ] 

Weiwei Yang commented on HDFS-12098:


Attached a test case patch to reproduce this issue. Please take a look at 
[^HDFS-12098-HDFS-7240.testcase.patch]. This patch simulates the scenario

# Start mini ozone cluster without starting scm
# Datanode is unable to register to scm
# Start scm, waiting for datanode to register
# Wait a while but datanode is still unable to successfully register to scm

if you apply this patch, it's gonna fail. You might have noticed the patch 
changes some more code than just adding a test, that is because the reason I 
mentioned earlier. I also have added a method to check if a datanode is 
registered to scm so that we can check datanode state even scm is not started.

I have a patch to fix this also, if applied that patch, this test will pass. I 
am  ready to share that as well.

Thanks

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase.patch, Screen 
> Shot 2017-07-11 at 4.58.08 PM.png, thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088417#comment-16088417
 ] 

Hadoop QA commented on HDFS-12098:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} HDFS-7240 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 
34s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
52s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
39s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
56s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
52s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green} HDFS-7240 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 35s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 5 new + 154 unchanged - 0 fixed = 159 total (was 154) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} findbugs {color} | {color:red}  1m 
58s{color} | {color:red} hadoop-hdfs-project/hadoop-hdfs generated 1 new + 0 
unchanged - 0 fixed = 1 total (was 0) {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 35s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 93m 10s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| FindBugs | module:hadoop-hdfs-project/hadoop-hdfs |
|  |  Inconsistent synchronization of 
org.apache.hadoop.hdfs.server.datanode.DataNode.datanodeStateMachine; locked 
42% of time  Unsynchronized access at DataNode.java:42% of time  Unsynchronized 
access at DataNode.java:[line 3228] |
| Failed junit tests | hadoop.ozone.TestMiniOzoneCluster |
|   | hadoop.hdfs.qjournal.client.TestQuorumJournalManager |
|   | hadoop.ozone.container.replication.TestContainerReplicationManager |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 |
|   | hadoop.ozone.TestOzoneConfigurationFields |
| Timed out junit tests | 
org.apache.hadoop.ozone.container.ozoneimpl.TestRatisManager |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HDFS-12098 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12877433/HDFS-12098-HDFS-7240.testcase.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 719ca50388a4 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 
12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | HDFS-7240 / 90f1d58 |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20284/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| findbugs | 

[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-14 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088411#comment-16088411
 ] 

Weiwei Yang commented on HDFS-12098:


Please hold on looking at the test patch, it still has some problems.. working 
on a new one :P

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase.patch, Screen 
> Shot 2017-07-11 at 4.58.08 PM.png, thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-14 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088382#comment-16088382
 ] 

Weiwei Yang commented on HDFS-12098:


Hi [~anu]

I just uploaded a test case patch to reproduce this problem from UT. I revised 
some code about how scm was started in MiniOzoneCluster, ensures that scm 
constructor is only called when scm is started. In this case, I could reproduce 
the same issue as I was seeing from a real setup. Please take a look and if you 
are agree with the problem I described, we then can look at the fix.

Thank you. 

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, HDFS-12098-HDFS-7240.testcase.patch, Screen 
> Shot 2017-07-11 at 4.58.08 PM.png, thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-13 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086834#comment-16086834
 ] 

Anu Engineer commented on HDFS-12098:
-

Thank you for detailed repro steps, I will look at this tomorrow.


> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log. (_I debugged into the code and found the datanode 
> state was transited SHUTDOWN unexpectedly because the thread leaks, each of 
> those threads counted to set to next state and they all set to SHUTDOWN 
> state_)
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-13 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086828#comment-16086828
 ] 

Weiwei Yang commented on HDFS-12098:


Hi [~anu]

bq. How do you start SCM, I always do bin/hdfs start scm or --daemon start scm. 
Do you do it differently ?

No, same. I realized the reproducing steps in the description was not clear, 
sorry about that. I just added some more details about the issue itself and how 
to reproduce it, please take a look. I'll work on reproducing this from UT as 
well.

Thank you.

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> 1. Start namenode
> {{./bin/hdfs --daemon start namenode}}
> 2. Start datanode
> {{./bin/hdfs datanode}}
> will see following connection issues
> {noformat}
> 17/07/13 21:16:48 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:49 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:50 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> 17/07/13 21:16:51 INFO ipc.Client: Retrying connect to server: 
> ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
> SECONDS)
> {noformat}
> this is expected because scm is not started yet
> 3. Start scm
> {{./bin/hdfs scm}}
> expecting datanode can register to this scm, expecting the log in scm
> {noformat}
> 17/07/13 21:22:30 INFO node.SCMNodeManager: Data node with ID: 
> af22862d-aafa-4941-9073-53224ae43e2c Registered.
> {noformat}
> but did *NOT* see this log.
> 4. Create a container from scm CLI
> {{./bin/hdfs scm -container -create -c 20170714c0}}
> this fails with following exception
> {noformat}
> Creating container : 20170714c0.
> Error executing 
> command:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ozone.scm.exceptions.SCMException):
>  Unable to create container while in chill mode
>   at 
> org.apache.hadoop.ozone.scm.container.ContainerMapping.allocateContainer(ContainerMapping.java:241)
>   at 
> org.apache.hadoop.ozone.scm.StorageContainerManager.allocateContainer(StorageContainerManager.java:392)
>   at 
> org.apache.hadoop.ozone.protocolPB.StorageContainerLocationProtocolServerSideTranslatorPB.allocateContainer(StorageContainerLocationProtocolServerSideTranslatorPB.java:73)
> {noformat}
> datanode was not registered to scm, thus it's still in chill mode.
> *Note*, if we start scm first, there is no such issue, I can create container 
> from CLI without any problem.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-13 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086814#comment-16086814
 ] 

Anu Engineer commented on HDFS-12098:
-

>From the description of the problem.

bq. Start SCM, expecting datanode could connect to the scm and the state 
machine could transit to RUNNING. However in actual, its state transits to 
SHUTDOWN, datanode enters chill mode.

How do you start SCM, I always do bin/hdfs start scm or --daemon start scm.  Do 
you do it differently ? 

Anyways, I will try to debug this in a cluster.



> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-13 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086658#comment-16086658
 ] 

Weiwei Yang commented on HDFS-12098:


Hi [~anu]

bq. Looks like the main does call, SCM constructor ...

The main method you pasted is only called if scm is started, in bin/hdfs

{code}
scm)
  HADOOP_CLASSNAME='org.apache.hadoop.ozone.scm.StorageContainerManager'
  ...
{code}

if I don't start scm (like how I reproduce this issue in the description), it 
won't be called, and the port will not be bound. That's what I meant.

Thanks

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-13 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086064#comment-16086064
 ] 

Anu Engineer commented on HDFS-12098:
-

[~cheersyang] Sorry to be so dense. I am not sure I understand what this means 
well.
bq. However, in a real cluster environment. Scm constructor will not be called, 
so the port will not be bound. 
Looks like the main does call, SCM constructor 
 
{code}
 /**
   * Main entry point for starting StorageContainerManager.
   *
   * @param argv arguments
   * @throws IOException if startup fails due to I/O error
   */
  public static void main(String[] argv) throws IOException {
StringUtils.startupShutdownMessage(StorageContainerManager.class,
argv, LOG);
try {
  StorageContainerManager scm = new StorageContainerManager(
  new OzoneConfiguration());
  scm.start();
  scm.join();
} catch (Throwable t) {
  LOG.error("Failed to start the StorageContainerManager.", t);
  terminate(1, t);
}
  }
{code}

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-13 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16086056#comment-16086056
 ] 

Anu Engineer commented on HDFS-12098:
-

[~cheersyang] Thanks for the analysis. I would love if MiniOzoneCluster is able 
to simulate issues in the real cluster. If we are able to reproduce issues in 
the real cluster using MiniOzoneCluster then it is a real win for us. Let me 
take a look at this, I am hoping the changes you are suggesting for SCM is not 
too complex to simulate this in MiniOzoneCluster.

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-13 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16085463#comment-16085463
 ] 

Weiwei Yang commented on HDFS-12098:


Ah found the difference after hours of debugging ... it's not that easy to get 
this reproduced from mini cluster, let me explain, the behavior is different 
from mini cluster and a real cluster setup,

*Mini Cluster*
In class {{MiniOzoneCluster}}, we are initiating SCM like

{code}
StorageContainerManager scm = new StorageContainerManager(conf);
f(!disableSCM) {
  // start SCM if it is not disabled.
  scm.start();
}
{code}

the constructor of scm will init scm datanode, client RPC servers.  During the 
initiation, {{RPC.Builder(conf)...build()}} will bind the RPC server to the 
specific port, once the port is bound, subsequent client RPC calls e.g

{code}
 SCMVersionResponseProto versionResponse =
  rpcEndPoint.getEndPoint().getVersion(null);
{code}

will try to connect that port and read data, however the service is not 
responding, thus it gets a {{SocketTimeout}}.

*Real Cluster*

However, in a real cluster environment. Scm constructor will not be called, so 
the port will not be bound. When the RPC client tries to connect to that port, 
it gets a {{connection refused error}}. This error is caught and triggered the 
RetryPolicy, that's where I saw 10 times of retry which causes this problem 
(thread leak).

I am not sure if it is worth to fix this problem in mini cluster, that probably 
needs to refactor the SCM constructor to move RPC init code out. Since this 
issue can be simply reproduced in a cluster setup following the steps in the 
description.

Please kindly advise. Thanks.

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-11 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083334#comment-16083334
 ] 

Weiwei Yang commented on HDFS-12098:


Hi [~anu]

The difference I noticed is in the mini cluster, the RPC seems directly times 
out without retrying, not sure why the retry policy was not applied. On my 
setup I saw following retries in getVersion call,

{noformat}
17/07/11 19:27:05 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 4 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
MILLISECONDS)
17/07/11 19:27:06 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 5 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
MILLISECONDS)
17/07/11 19:27:07 INFO ipc.Client: Retrying connect to server: 
ozone1.fyre.ibm.com/172.16.165.133:9861. Already tried 6 time(s); retry policy 
is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 
MILLISECONDS)
{noformat}

these retries will keep the thread alive even the task execution is done. I 
will try to reproduce in a test case.

Thank you for looking at this.

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083237#comment-16083237
 ] 

Hadoop QA commented on HDFS-12098:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red}  0m  5s{color} 
| {color:red} HDFS-12098 does not apply to trunk. Rebase required? Wrong 
Branch? See https://wiki.apache.org/hadoop/HowToContribute for help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | HDFS-12098 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12876725/disabled-scm-test.patch
 |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20239/console |
| Powered by | Apache Yetus 0.6.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: disabled-scm-test.patch, HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-11 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083232#comment-16083232
 ] 

Anu Engineer commented on HDFS-12098:
-

One big difference is that fact that I have 1000 millisecond time out for the 
socket calls in tests.


> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-11 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16083229#comment-16083229
 ] 

Anu Engineer commented on HDFS-12098:
-

@Weiwei yang, Can you please share your repro steps once again ? or look at 
this test patch that I have created ? 

I have added a disable SCM call, when tests run, I can see we do not hit the 
SCM.
{code}
java.net.SocketTimeoutException: Call From hw11767.home/192.168.29.224 to 
0.0.0.0:58880 failed on socket timeout exception: 
java.net.SocketTimeoutException: 1000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels
{code}

However, I am not able to see many Datanode state machine threads. Please see 
the attached snapshot from my profiler.
I have also attached a test case that I developed to simulate and debug this 
case.

Thanks
Anu





> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, Screen Shot 2017-07-11 at 4.58.08 PM.png, 
> thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-11 Thread Anu Engineer (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16082656#comment-16082656
 ] 

Anu Engineer commented on HDFS-12098:
-

[~cheersyang] Thanks for reporting this and posting a patch. Before commenting 
on this I would like to simulate this in our unit tests and then test with and 
without your patch.  I am going to modify MiniOzoneCluster  and build it with 
flags called *disableSCM* and *disableKSM*, so we can simulate SCM or KSM being 
down. I will be able to explore the behavior in greater detail with that.

Some thoughts on this patch, if my understanding is correct, isn't the root 
issue that we time out but forget to communicate to the running thread we have 
already timed out ? I was wondering if we add a a AtomicBoolean to each task 
which indicates if it has timed out, then perhaps when the thread comes out it 
can understand the caller has timed out and it will exist that thread ? Do you 
think it will address this issue ? 

The reason why I am asking is that, if we pursue the approach of a single 
thread -- then we have to create many state machines for various tasks -- like 
many SCMs or running some complex SCM commands. 

I am fine with that approach too , but something that I wanted to us to 
consider.


> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
> Attachments: HDFS-12098-HDFS-7240.001.patch, 
> HDFS-12098-HDFS-7240.002.patch, thread_dump.log
>
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079862#comment-16079862
 ] 

Hadoop QA commented on HDFS-12098:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
11s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} HDFS-7240 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 14m 
40s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
37s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
59s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
51s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
53s{color} | {color:green} HDFS-7240 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 38s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 93m 14s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 |
| Timed out junit tests | 
org.apache.hadoop.ozone.container.ozoneimpl.TestRatisManager |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HDFS-12098 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12876355/HDFS-12098-HDFS-7240.002.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux 38d58a4eea69 3.13.0-119-generic #166-Ubuntu SMP Wed May 3 
12:18:55 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | HDFS-7240 / 87154fc |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20206/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20206/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20206/console |
| Powered by | Apache Yetus 0.6.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  

[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078409#comment-16078409
 ] 

Hadoop QA commented on HDFS-12098:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
9s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} HDFS-7240 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 18m 
19s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
51s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
58s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} HDFS-7240 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
51s{color} | {color:green} HDFS-7240 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 35s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 4 new + 1 unchanged - 0 fixed = 5 total (was 1) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 17s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 96m 33s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestDFSStripedInputStreamWithRandomECPolicy |
|   | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure080 |
|   | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:14b5c93 |
| JIRA Issue | HDFS-12098 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12876099/HDFS-12098-HDFS-7240.001.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| uname | Linux de496575ec93 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 
14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | HDFS-7240 / 5fd38a6 |
| Default Java | 1.8.0_131 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20188/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20188/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20188/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/20188/console |
| Powered by | Apache Yetus 0.5.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> Ozone: Datanode is unable to register with scm if scm 

[jira] [Commented] (HDFS-12098) Ozone: Datanode is unable to register with scm if scm starts later

2017-07-07 Thread Weiwei Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16078213#comment-16078213
 ] 

Weiwei Yang commented on HDFS-12098:


This is because datanode state machine leaks {{VersionEndpointTask}} thread. In 
the case scm is not yet started,
 more and more {{VersionEndpointTask}} threads keep retrying connection with 
scm,

{noformat}
INIT - RUNNING 
 \
GETVERSION
   executor.execute(new VersionEndpointTask()) - retry on 
getVersion ...
   ... (HB interval)
   executor.execute(new VersionEndpointTask()) - retry on 
getVersion ...
   ... (HB interval)
   executor.execute(new VersionEndpointTask()) - retry on 
getVersion ...
   ...
{noformat}

the version endpoint tasks are launched in HB interval (5s on my env), so every 
5s there is a new task submitted; the retry policy for each getVersion call is 
10 * 1s = 10s, so every 10s a task can be finished. So every 10s there will be 
ONE thread leak.

When scm is up, all pending tasks will be able to connect to scm and getVersion 
call returns, so each of them will count the state to next, since the state is 
shared in {{EndpointStateMachine}}, it increments more than 1 so when I review 
the state changes, it looks like below

{noformat}
REGISTER
HEARTBEAT
SHUTDOWN
SHUTDOWN
SHUTDOWN
... 
{noformat}

> Ozone: Datanode is unable to register with scm if scm starts later
> --
>
> Key: HDFS-12098
> URL: https://issues.apache.org/jira/browse/HDFS-12098
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: datanode, ozone, scm
>Reporter: Weiwei Yang
>Assignee: Weiwei Yang
>Priority: Critical
>
> Reproducing steps
> # Start datanode
> # Wait and see datanode state, it has connection issues, this is expected
> # Start SCM, expecting datanode could connect to the scm and the state 
> machine could transit to RUNNING. However in actual, its state transits to 
> SHUTDOWN, datanode enters chill mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org