[jira] [Commented] (HDFS-14908) LeaseManager should check parent-child relationship when filter open files.

2019-10-15 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952094#comment-16952094
 ] 

Wei-Chiu Chuang commented on HDFS-14908:


[~linyiqun] wanna take a look?

> LeaseManager should check parent-child relationship when filter open files.
> ---
>
> Key: HDFS-14908
> URL: https://issues.apache.org/jira/browse/HDFS-14908
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.1
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HDFS-14908.001.patch
>
>
> Now when doing listOpenFiles(), LeaseManager only checks whether the filter 
> path is the prefix of the open files. We should check whether the filter path 
> is the parent/ancestor of the open files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14908) LeaseManager should check parent-child relationship when filter open files.

2019-10-15 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14908:
---
Affects Version/s: 3.1.0
   3.0.1

> LeaseManager should check parent-child relationship when filter open files.
> ---
>
> Key: HDFS-14908
> URL: https://issues.apache.org/jira/browse/HDFS-14908
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.1.0, 3.0.1
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HDFS-14908.001.patch
>
>
> Now when doing listOpenFiles(), LeaseManager only checks whether the filter 
> path is the prefix of the open files. We should check whether the filter path 
> is the parent/ancestor of the open files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14271) [SBN read] StandbyException is logged if Observer is the first NameNode

2019-10-13 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14271:
---
Labels: multi-sbnn  (was: )

> [SBN read] StandbyException is logged if Observer is the first NameNode
> ---
>
> Key: HDFS-14271
> URL: https://issues.apache.org/jira/browse/HDFS-14271
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.3.0
>Reporter: Wei-Chiu Chuang
>Assignee: Shen Yinjie
>Priority: Minor
>  Labels: multi-sbnn
> Attachments: HDFS-14271_1.patch
>
>
> If I transition the first NameNode into Observer state, and then I create a 
> file from command line, it prints the following StandbyException log message, 
> as if the command failed. But it actually completed successfully:
> {noformat}
> [root@weichiu-sbsr-1 ~]# hdfs dfs -touchz /tmp/abf
> 19/02/12 16:35:17 INFO retry.RetryInvocationHandler: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
>  Operation category WRITE is not supported in state observer. Visit 
> https://s.apache.org/sbnn-error
>   at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:1987)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1424)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:762)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:458)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:530)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:918)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:853)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2782)
> , while invoking $Proxy4.create over 
> [weichiu-sbsr-1.gce.cloudera.com/172.31.121.145:8020,weichiu-sbsr-2.gce.cloudera.com/172.31.121.140:8020].
>  Trying to failover immediately.
> {noformat}
> This is unlike the case when the first NameNode is the Standby, where this 
> StandbyException is suppressed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-6524) Choosing datanode retries times considering with block replica number

2019-10-13 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16950369#comment-16950369
 ] 

Wei-Chiu Chuang commented on HDFS-6524:
---

Looks like a good improvement, thanks [~leosun08].
Additionally, you might find HDFS hedged read useful. There doesn't appear to 
be a good reference in the Apache Hadoop doc, but you can find additional info 
in HBase doc http://hbase.apache.org/book.html#hedged.reads

> Choosing datanode  retries times considering with block replica number
> --
>
> Key: HDFS-6524
> URL: https://issues.apache.org/jira/browse/HDFS-6524
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 3.0.0-alpha1
>Reporter: Liang Xie
>Assignee: Lisheng Sun
>Priority: Minor
>  Labels: BB2015-05-TBR
> Attachments: HDFS-6524.001.patch, HDFS-6524.002.patch, 
> HDFS-6524.003.patch, HDFS-6524.004.patch, HDFS-6524.005(2).patch, 
> HDFS-6524.005.patch, HDFS-6524.006.patch, HDFS-6524.007.patch, HDFS-6524.txt
>
>
> Currently the chooseDataNode() does retry with the setting: 
> dfsClientConf.maxBlockAcquireFailures, which by default is 3 
> (DFS_CLIENT_MAX_BLOCK_ACQUIRE_FAILURES_DEFAULT = 3), it would be better 
> having another option, block replication factor. One cluster with only  two 
> block replica setting, or using Reed-solomon encoding solution with one 
> replica factor. It helps to reduce the long tail latency.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14190) Copying folders containing = - characters between hdfs (using webhdfs) does not work in distcp

2019-10-09 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947850#comment-16947850
 ] 

Wei-Chiu Chuang commented on HDFS-14190:


I suspect HDFS-14323 or HDFS-14423 fixed it.

> Copying folders containing = - characters between hdfs (using webhdfs) does 
> not work in distcp
> --
>
> Key: HDFS-14190
> URL: https://issues.apache.org/jira/browse/HDFS-14190
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: distcp
>Affects Versions: 3.1.1
>Reporter: yinsong
>Assignee: Aihua Xu
>Priority: Major
>
> Copying folders containing = - characters between hdfs (using webhdfs) does 
> not work in distcp
> for example:
> src:hadoop2.7  target:hadoop3.1.1
> (1)
> hadoop distcp \
> -pugp \
> -i \
> webhdfs://1.1.1.1:50070/sudiyi_datawarehouse 
> webhdfs://2.2.2.2:50070/sudiyi_datawarehouse
> ERROR tools.SimpleCopyListing: FileNotFoundException exception in listStatus: 
> File /sudiyi_datawarehouse/st_device_standard_ds/date_time%3D2018-10-10 does 
> not exist
>  
> (2)
> hadoop distcp \
> -Dmapreduce.framework.name=yarn \
> -pugp \
> -i \
> webhdfs://1.1.1.1:50070/druid webhdfs://2.2.2.2:50070/druid
> Error: java.io.IOException: File copy failed: 
> webhdfs://10.26.93.65:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  --> 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:259)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:217)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:48)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)
> Caused by: java.io.IOException: Couldn't run retriable-command: Copying 
> webhdfs://10.26.93.65:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  to 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:256)
>  ... 10 more
> Caused by: java.io.IOException: Failed to promote 
> tmp-file:webhdfs://10.27.234.198:50070/druid/.distcp.tmp.attempt_1545990837043_0016_m_15_2
>  to: 
> webhdfs://10.27.234.198:50070/druid/indexing-logs/kill_task-myapp_V1-2018-04-26T16_20_55+0800
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.promoteTmpToTarget(RetriableFileCopyCommand.java:250)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:140)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14902) RBF: NullPointer When Misconfigured

2019-10-08 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14902:
---
Summary: RBF: NullPointer When Misconfigured  (was: NullPointer When 
Misconfigured)

> RBF: NullPointer When Misconfigured
> ---
>
> Key: HDFS-14902
> URL: https://issues.apache.org/jira/browse/HDFS-14902
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: rbf
>Affects Versions: 3.2.0
>Reporter: David Mollitor
>Priority: Minor
>
> Admittedly the server was mis-configured, but this should be a bit more 
> elegant.
> {code:none}
> 2019-10-08 11:19:52,505 ERROR router.NamenodeHeartbeatService: Unhandled 
> exception updating NN registration for null:null
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.federation.protocol.proto.HdfsServerFederationProtos$NamenodeMembershipRecordProto$Builder.setServiceAddress(HdfsServerFederationProtos.java:3831)
>   at 
> org.apache.hadoop.hdfs.server.federation.store.records.impl.pb.MembershipStatePBImpl.setServiceAddress(MembershipStatePBImpl.java:119)
>   at 
> org.apache.hadoop.hdfs.server.federation.store.records.MembershipState.newInstance(MembershipState.java:108)
>   at 
> org.apache.hadoop.hdfs.server.federation.resolver.MembershipNamenodeResolver.registerNamenode(MembershipNamenodeResolver.java:259)
>   at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.updateState(NamenodeHeartbeatService.java:223)
>   at 
> org.apache.hadoop.hdfs.server.federation.router.NamenodeHeartbeatService.periodicInvoke(NamenodeHeartbeatService.java:159)
>   at 
> org.apache.hadoop.hdfs.server.federation.router.PeriodicService$1.run(PeriodicService.java:178)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13806) EC: No error message for unsetting EC policy of the directory inherits the erasure coding policy from an ancestor directory

2019-10-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-13806:
---
Fix Version/s: 3.1.4

> EC: No error message for unsetting EC policy of the directory inherits the 
> erasure coding policy from an ancestor directory
> ---
>
> Key: HDFS-13806
> URL: https://issues.apache.org/jira/browse/HDFS-13806
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Affects Versions: 3.0.0
> Environment: 3 Node SUSE Linux cluster
>Reporter: Souryakanta Dwivedy
>Assignee: Ayush Saxena
>Priority: Minor
> Fix For: 3.2.0, 3.1.4
>
> Attachments: HDFS-13806-01.patch, HDFS-13806-02.patch, 
> HDFS-13806-03.patch, HDFS-13806-04.patch, HDFS-13806-05.patch, 
> HDFS-13806-06.patch, No_error_unset_ec_policy.png
>
>
> No error message thrown for unsetting EC policy of the directory inherits the 
> erasure coding policy from an ancestor directory
> Steps :-
> --
>  * Create a Directory
>  - Set EC policy for the Directory
>  - Create a file in-side that Directory 
>  - Create a sub-directory inside the parent directory
>  - Check both the file and sub-directory inherit the EC policy from parent 
> directory
>  - Try to unset EC Policy for the file and check it will throw error as [ 
> Cannot unset an erasure coding policy on a file]
>  - Try to unset EC Policy for the sub-directory and check it will throw a 
> success message as [Unset erasure coding policy from ] 
>  instead of throwing the error message,which is wrong behavior
> Actual output :-
> No proper error message thrown for unsetting EC policy of the directory 
> inherits the erasure coding policy from an ancestor directory
>  A success message is displayed instead of throwing an error message
>  Expected output :-
>  
>  Proper error message should be thrown while trying to unset EC policy of the 
> directory inherits the erasure coding policy from an ancestor directory
>  like error message thrown while unsetting the EC policy of a file inherits 
> the erasure coding policy from an ancestor directory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14497) Write lock held by metasave impact following RPC processing

2019-10-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14497:
---
Fix Version/s: 3.2.2
   3.1.4

> Write lock held by metasave impact following RPC processing
> ---
>
> Key: HDFS-14497
> URL: https://issues.apache.org/jira/browse/HDFS-14497
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: He Xiaoqiao
>Assignee: He Xiaoqiao
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14497-addendum.001.patch, HDFS-14497.001.patch
>
>
> NameNode meta save hold global write lock currently, so following RPC r/w 
> request or inner-thread of NameNode could be paused if they try to acquire 
> global read/write lock and have to wait before metasave release it.
> I propose to change write lock to read lock and let some read request could 
> be process normally. I think it could not change informations which meta save 
> try to get if we try to open read request.
> Actually, we need ensure that there are only one thread to execute metaSave, 
> otherwise, output streams could meet exception especially both streams hold 
> the same file handle or some other same output stream.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir

2019-10-04 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944767#comment-16944767
 ] 

Wei-Chiu Chuang commented on HDFS-2470:
---

Thanks! Just in time!

> NN should automatically set permissions on dfs.namenode.*.dir
> -
>
> Key: HDFS-2470
> URL: https://issues.apache.org/jira/browse/HDFS-2470
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron Myers
>Assignee: Siddharth Wagle
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, 
> HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, 
> HDFS-2470.06.patch, HDFS-2470.07.patch, HDFS-2470.08.patch, 
> HDFS-2470.09.patch, HDFS-2470.branch-3.1.patch
>
>
> Much as the DN currently sets the correct permissions for the 
> dfs.datanode.data.dir, the NN should do the same for the 
> dfs.namenode.(name|edit).dir.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14890) Setting permissions on name directory fails on non posix compliant filesystems

2019-10-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14890:
---
Fix Version/s: 3.1.4

> Setting permissions on name directory fails on non posix compliant filesystems
> --
>
> Key: HDFS-14890
> URL: https://issues.apache.org/jira/browse/HDFS-14890
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.1
> Environment: Windows 10.
>Reporter: hirik
>Assignee: Siddharth Wagle
>Priority: Blocker
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14890.01.patch
>
>
> Hi,
> HDFS NameNode and JournalNode are not starting in Windows machine. Found 
> below related exception in logs. 
> Caused by: java.lang.UnsupportedOperationExceptionCaused by: 
> java.lang.UnsupportedOperationException
> at java.base/java.nio.file.Files.setPosixFilePermissions(Files.java:2155)
> at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.clearDirectory(Storage.java:452)
> at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:591)
> at org.apache.hadoop.hdfs.server.namenode.NNStorage.format(NNStorage.java:613)
> at org.apache.hadoop.hdfs.server.namenode.FSImage.format(FSImage.java:188)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:1206)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:422)
> at 
> com.slog.dfs.hdfs.nn.NameNodeServiceImpl.delayedStart(NameNodeServiceImpl.java:147)
>  
> Code changes related to this issue: 
> [https://github.com/apache/hadoop/commit/07e3cf952eac9e47e7bd5e195b0f9fc28c468313#diff-1a56e69d50f21b059637cfcbf1d23f11]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir

2019-10-04 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944750#comment-16944750
 ] 

Wei-Chiu Chuang commented on HDFS-2470:
---

Pushed to branch-3.1 with trivial conflicts. Attached  
[^HDFS-2470.branch-3.1.patch]  for posterity.

> NN should automatically set permissions on dfs.namenode.*.dir
> -
>
> Key: HDFS-2470
> URL: https://issues.apache.org/jira/browse/HDFS-2470
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron Myers
>Assignee: Siddharth Wagle
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, 
> HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, 
> HDFS-2470.06.patch, HDFS-2470.07.patch, HDFS-2470.08.patch, 
> HDFS-2470.09.patch, HDFS-2470.branch-3.1.patch
>
>
> Much as the DN currently sets the correct permissions for the 
> dfs.datanode.data.dir, the NN should do the same for the 
> dfs.namenode.(name|edit).dir.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir

2019-10-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-2470:
--
Attachment: HDFS-2470.branch-3.1.patch

> NN should automatically set permissions on dfs.namenode.*.dir
> -
>
> Key: HDFS-2470
> URL: https://issues.apache.org/jira/browse/HDFS-2470
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron Myers
>Assignee: Siddharth Wagle
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, 
> HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, 
> HDFS-2470.06.patch, HDFS-2470.07.patch, HDFS-2470.08.patch, 
> HDFS-2470.09.patch, HDFS-2470.branch-3.1.patch
>
>
> Much as the DN currently sets the correct permissions for the 
> dfs.datanode.data.dir, the NN should do the same for the 
> dfs.namenode.(name|edit).dir.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-2470) NN should automatically set permissions on dfs.namenode.*.dir

2019-10-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-2470:
--
Fix Version/s: 3.1.4

> NN should automatically set permissions on dfs.namenode.*.dir
> -
>
> Key: HDFS-2470
> URL: https://issues.apache.org/jira/browse/HDFS-2470
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron Myers
>Assignee: Siddharth Wagle
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-2470.01.patch, HDFS-2470.02.patch, 
> HDFS-2470.03.patch, HDFS-2470.04.patch, HDFS-2470.05.patch, 
> HDFS-2470.06.patch, HDFS-2470.07.patch, HDFS-2470.08.patch, HDFS-2470.09.patch
>
>
> Much as the DN currently sets the correct permissions for the 
> dfs.datanode.data.dir, the NN should do the same for the 
> dfs.namenode.(name|edit).dir.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14892) Close the output stream if createWrappedOutputStream() fails

2019-10-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14892:
---
Component/s: encryption

> Close the output stream if createWrappedOutputStream() fails
> 
>
> Key: HDFS-14892
> URL: https://issues.apache.org/jira/browse/HDFS-14892
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption
>Reporter: Kihwal Lee
>Priority: Major
>
> create() in an encryption zone is a two step process by the client. First, a 
> regular FSOutputStream is created and then it is wrapped with an encrypted 
> stream.  When there is a system issue or a KMS ACL-based denial, the second 
> phase will fail. If the client terminates right away, the shutdown hook 
> closes the output stream opened in the first phase.  But if the client lives 
> on, the output stream will leak.
> Datanode's WebHdfsHandler, DFSClient, DistributedFileSystem, Hdfs 
> (FileContext) and RpcProgramNfs3 do this.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14686) HttpFS: HttpFSFileSystem#getErasureCodingPolicy always returns null

2019-10-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14686:
---
Fix Version/s: 3.2.2
   3.1.4

> HttpFS: HttpFSFileSystem#getErasureCodingPolicy always returns null
> ---
>
> Key: HDFS-14686
> URL: https://issues.apache.org/jira/browse/HDFS-14686
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: httpfs
>Affects Versions: 3.2.0
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
>
> The root cause is that *FSOperations#contentSummaryToJSON* doesn't parse 
> *ContentSummary.erasureCodingPolicy* into the json.
> The expected behavior is that *HttpFSFileSystem#getErasureCodingPolicy* 
> should at least return "" (empty string, for directories or symlinks), or 
> "Replicated" (for non-EC files), "RS-6-3-1024k", etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14892) Close the output stream if createWrappedOutputStream() fails

2019-10-04 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944675#comment-16944675
 ] 

Wei-Chiu Chuang commented on HDFS-14892:


Good finding, Kihwal. I am aware the client creates an empty file in such a 
case but didn't realize the upstream leaks.

> Close the output stream if createWrappedOutputStream() fails
> 
>
> Key: HDFS-14892
> URL: https://issues.apache.org/jira/browse/HDFS-14892
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Major
>
> create() in an encryption zone is a two step process by the client. First, a 
> regular FSOutputStream is created and then it is wrapped with an encrypted 
> stream.  When there is a system issue or a KMS ACL-based denial, the second 
> phase will fail. If the client terminates right away, the shutdown hook 
> closes the output stream opened in the first phase.  But if the client lives 
> on, the output stream will leak.
> Datanode's WebHdfsHandler, DFSClient, DistributedFileSystem, Hdfs 
> (FileContext) and RpcProgramNfs3 do this.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13693) Remove unnecessary search in INodeDirectory.addChild during image loading

2019-10-04 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-13693:
---
Fix Version/s: 3.2.2
   3.1.4

> Remove unnecessary search in INodeDirectory.addChild during image loading
> -
>
> Key: HDFS-13693
> URL: https://issues.apache.org/jira/browse/HDFS-13693
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: zhouyingchao
>Assignee: Lisheng Sun
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-13693-001.patch, HDFS-13693-002.patch, 
> HDFS-13693-003.patch, HDFS-13693-004.patch, HDFS-13693-005.patch
>
>
> In FSImageFormatPBINode.loadINodeDirectorySection, all child INodes are added 
> to their parent INode's map one by one. The adding procedure will search a 
> position in the parent's map and then insert the child to the position. 
> However, during image loading, the search is unnecessary since the insert 
> position should always be at the end of the map given the sequence they are 
> serialized on disk.
> Test this patch against a fsimage of a 70PB  cluster (200million files and 
> 300million blocks), the image loading time be reduced from 1210 seconds to 
> 1138 seconds.So it can reduce up to about 10% of time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14637) Namenode may not replicate blocks to meet the policy after enabling upgradeDomain

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14637:
---
Fix Version/s: 3.2.2
   3.1.4
   3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed to trunk. Cherry picked the commit to branch-3.2 and branch-3.1 with 
trivial conflicts in test code.

Thanks [~sodonnell] and [~ayushtkn]!

> Namenode may not replicate blocks to meet the policy after enabling 
> upgradeDomain
> -
>
> Key: HDFS-14637
> URL: https://issues.apache.org/jira/browse/HDFS-14637
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14637.001.patch, HDFS-14637.002.patch, 
> HDFS-14637.003.patch, HDFS-14637.004.patch, HDFS-14637.005.patch, 
> HDFS-14637.branch-3.1.patch, HDFS-14637.branch-3.2.patch
>
>
> After changing the network topology or placement policy on a cluster and 
> restarting the namenode, the namenode will scan all blocks on the cluster at 
> startup, and check if they meet the current placement policy. If they do not, 
> they are added to the replication queue and the namenode will arrange for 
> them to be replicated to ensure the placement policy is used.
> If you start with a cluster with no UpgradeDomain, and then enable 
> UpgradeDomain, then on restart the NN does notice all the blocks violate the 
> placement policy and it adds them to the replication queue. I believe there 
> are some issues in the logic that prevents the blocks from replicating 
> depending on the setup:
> With UD enabled, but no racks configured, and possible on a 2 rack cluster, 
> the queued replication work never makes any progress, as in 
> blockManager.validateReconstructionWork(), it checks to see if the new 
> replica increases the number of racks, and if it does not, it skips it and 
> tries again later.
> {code:java}
> DatanodeStorageInfo[] targets = rw.getTargets();
> if ((numReplicas.liveReplicas() >= requiredRedundancy) &&
> (!isPlacementPolicySatisfied(block)) ) {
>   if (!isInNewRack(rw.getSrcNodes(), targets[0].getDatanodeDescriptor())) {
> // No use continuing, unless a new rack in this case
> return false;
>   }
>   // mark that the reconstruction work is to replicate internal block to a
>   // new rack.
>   rw.setNotEnoughRack();
> }
> {code}
> Additionally, in blockManager.scheduleReconstruction() is there some logic 
> that sets the number of new replicas required to one, if the live replicas >= 
> requiredReduncancy:
> {code:java}
> int additionalReplRequired;
> if (numReplicas.liveReplicas() < requiredRedundancy) {
>   additionalReplRequired = requiredRedundancy - numReplicas.liveReplicas()
>   - pendingNum;
> } else {
>   additionalReplRequired = 1; // Needed on a new rack
> }{code}
> With UD, it is possible for 2 new replicas to be needed to meet the block 
> placement policy, if all existing replicas are on nodes with the same domain. 
> For traditional '2 rack redundancy', only 1 new replica would ever have been 
> needed in this scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14637) Namenode may not replicate blocks to meet the policy after enabling upgradeDomain

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14637:
---
Attachment: HDFS-14637.branch-3.2.patch
HDFS-14637.branch-3.1.patch

> Namenode may not replicate blocks to meet the policy after enabling 
> upgradeDomain
> -
>
> Key: HDFS-14637
> URL: https://issues.apache.org/jira/browse/HDFS-14637
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14637.001.patch, HDFS-14637.002.patch, 
> HDFS-14637.003.patch, HDFS-14637.004.patch, HDFS-14637.005.patch, 
> HDFS-14637.branch-3.1.patch, HDFS-14637.branch-3.2.patch
>
>
> After changing the network topology or placement policy on a cluster and 
> restarting the namenode, the namenode will scan all blocks on the cluster at 
> startup, and check if they meet the current placement policy. If they do not, 
> they are added to the replication queue and the namenode will arrange for 
> them to be replicated to ensure the placement policy is used.
> If you start with a cluster with no UpgradeDomain, and then enable 
> UpgradeDomain, then on restart the NN does notice all the blocks violate the 
> placement policy and it adds them to the replication queue. I believe there 
> are some issues in the logic that prevents the blocks from replicating 
> depending on the setup:
> With UD enabled, but no racks configured, and possible on a 2 rack cluster, 
> the queued replication work never makes any progress, as in 
> blockManager.validateReconstructionWork(), it checks to see if the new 
> replica increases the number of racks, and if it does not, it skips it and 
> tries again later.
> {code:java}
> DatanodeStorageInfo[] targets = rw.getTargets();
> if ((numReplicas.liveReplicas() >= requiredRedundancy) &&
> (!isPlacementPolicySatisfied(block)) ) {
>   if (!isInNewRack(rw.getSrcNodes(), targets[0].getDatanodeDescriptor())) {
> // No use continuing, unless a new rack in this case
> return false;
>   }
>   // mark that the reconstruction work is to replicate internal block to a
>   // new rack.
>   rw.setNotEnoughRack();
> }
> {code}
> Additionally, in blockManager.scheduleReconstruction() is there some logic 
> that sets the number of new replicas required to one, if the live replicas >= 
> requiredReduncancy:
> {code:java}
> int additionalReplRequired;
> if (numReplicas.liveReplicas() < requiredRedundancy) {
>   additionalReplRequired = requiredRedundancy - numReplicas.liveReplicas()
>   - pendingNum;
> } else {
>   additionalReplRequired = 1; // Needed on a new rack
> }{code}
> With UD, it is possible for 2 new replicas to be needed to meet the block 
> placement policy, if all existing replicas are on nodes with the same domain. 
> For traditional '2 rack redundancy', only 1 new replica would ever have been 
> needed in this scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14850) Optimize FileSystemAccessService#getFileSystemConfiguration

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14850:
---
Fix Version/s: 3.2.2
   3.1.4

> Optimize FileSystemAccessService#getFileSystemConfiguration
> ---
>
> Key: HDFS-14850
> URL: https://issues.apache.org/jira/browse/HDFS-14850
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: httpfs, performance
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14850.001.patch, HDFS-14850.002.patch, 
> HDFS-14850.003.patch, HDFS-14850.004(2).patch, HDFS-14850.004.patch, 
> HDFS-14850.005.patch
>
>
> {code:java}
>  @Override
>   public Configuration getFileSystemConfiguration() {
> Configuration conf = new Configuration(true);
> ConfigurationUtils.copy(serviceHadoopConf, conf);
> conf.setBoolean(FILE_SYSTEM_SERVICE_CREATED, true);
> // Force-clear server-side umask to make HttpFS match WebHDFS behavior
> conf.set(FsPermission.UMASK_LABEL, "000");
> return conf;
>   }
> {code}
> As above code,when call 
> FileSystemAccessService#getFileSystemConfiguration,current code  new 
> Configuration every time.  
> It is not necessary and affects performance. I think it only need to new 
> Configuration in FileSystemAccessService#init once and  
> FileSystemAccessService#getFileSystemConfiguration get it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14527:
---
Fix Version/s: 3.1.4

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14624) When decommissioning a node, log remaining blocks to replicate periodically

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14624:
---
Fix Version/s: 3.2.2
   3.1.4

> When decommissioning a node, log remaining blocks to replicate periodically
> ---
>
> Key: HDFS-14624
> URL: https://issues.apache.org/jira/browse/HDFS-14624
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14624.001.patch, HDFS-14624.002.patch, 
> HDFS-14624.003.patch
>
>
> When a node is marked for decommission, there is a monitor thread which runs 
> every 30 seconds by default, and checks if the node still has pending blocks 
> to be replicated before the node can complete replication.
> There are two existing debug level messages logged in the monitor thread, 
> DatanodeAdminManager$Monitor.check(), which log the correct information 
> already, first as the pending blocks are replicated:
> {code:java}
> LOG.debug("Node {} still has {} blocks to replicate "
> + "before it is a candidate to finish {}.",
> dn, blocks.size(), dn.getAdminState());{code}
> And then after the initial set of blocks has completed and a rescan happens:
> {code:java}
> LOG.debug("Node {} {} healthy."
> + " It needs to replicate {} more blocks."
> + " {} is still in progress.", dn,
> isHealthy ? "is": "isn't", blocks.size(), dn.getAdminState());{code}
> I would like to propose moving these messages to INFO level so it is easier 
> to monitor decommission progress over time from the Namenode log.
> Based on the default settings, this would result in at most 1 log message per 
> node being decommissioned every 30 seconds. The reason this is at the most, 
> is because the monitor thread stops after checking after 500K blocks and 
> therefore in practice it could be as little as 1 log message per 30 seconds, 
> even if many DNs are being decommissioned at the same time.
> Note that the namenode webUI does display the above information, but having 
> this in the NN logs would allow progress to be tracked more easily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14499) Misleading REM_QUOTA value with snapshot and trash feature enabled for a directory

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14499:
---
Fix Version/s: 3.2.2
   3.1.4

> Misleading REM_QUOTA value with snapshot and trash feature enabled for a 
> directory
> --
>
> Key: HDFS-14499
> URL: https://issues.apache.org/jira/browse/HDFS-14499
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: snapshots
>Reporter: Shashikant Banerjee
>Assignee: Shashikant Banerjee
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14499.000.patch, HDFS-14499.001.patch, 
> HDFS-14499.002.patch
>
>
> This is the flow of steps where we see a discrepancy between REM_QUOTA and 
> new file operation failure. REM_QUOTA shows a value of  1 but file creation 
> operation does not succeed.
> {code:java}
> hdfs@c3265-node3 root$ hdfs dfs -mkdir /dir1
> hdfs@c3265-node3 root$ hdfs dfsadmin -setQuota 2 /dir1
> hdfs@c3265-node3 root$ hdfs dfsadmin -allowSnapshot /dir1
> Allowing snaphot on /dir1 succeeded
> hdfs@c3265-node3 root$ hdfs dfs -touchz /dir1/file1
> hdfs@c3265-node3 root$ hdfs dfs -createSnapshot /dir1 snap1
> Created snapshot /dir1/.snapshot/snap1
> hdfs@c3265-node3 root$ hdfs dfs -count -v -q /dir1
> QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE 
> PATHNAME
> 2 0 none inf 1 1 0 /dir1
> hdfs@c3265-node3 root$ hdfs dfs -rm /dir1/file1
> 19/03/26 11:20:25 INFO fs.TrashPolicyDefault: Moved: 
> 'hdfs://smajetinn/dir1/file1' to trash at: 
> hdfs://smajetinn/user/hdfs/.Trash/Current/dir1/file11553599225772
> hdfs@c3265-node3 root$ hdfs dfs -count -v -q /dir1
> QUOTA REM_QUOTA SPACE_QUOTA REM_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE 
> PATHNAME
> 2 1 none inf 1 0 0 /dir1
> hdfs@c3265-node3 root$ hdfs dfs -touchz /dir1/file1
> touchz: The NameSpace quota (directories and files) of directory /dir1 is 
> exceeded: quota=2 file count=3{code}
> The issue here, is that the count command takes only files and directories 
> into account not the inode references. When trash is enabled, the deletion of 
> files inside a directory actually does a rename operation as a result of 
> which an inode reference is maintained in the deleted list of the snapshot 
> diff which is taken into account while computing the namespace quota, but 
> count command (getContentSummary()) ,just takes into account just the files 
> and directories, not the referenced entity for calculating the REM_QUOTA. The 
> referenced entity is taken into account for space quota only.
> InodeReference.java:
> ---
> {code:java}
>  @Override
> public final ContentSummaryComputationContext computeContentSummary(
> int snapshotId, ContentSummaryComputationContext summary) {
>   final int s = snapshotId < lastSnapshotId ? snapshotId : lastSnapshotId;
>   // only count storagespace for WithName
>   final QuotaCounts q = computeQuotaUsage(
>   summary.getBlockStoragePolicySuite(), getStoragePolicyID(), false, 
> s);
>   summary.getCounts().addContent(Content.DISKSPACE, q.getStorageSpace());
>   summary.getCounts().addTypeSpaces(q.getTypeSpaces());
>   return summary;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14113) EC : Add Configuration to restrict UserDefined Policies

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14113:
---
Fix Version/s: 3.2.2
   3.1.4

> EC : Add Configuration to restrict UserDefined Policies
> ---
>
> Key: HDFS-14113
> URL: https://issues.apache.org/jira/browse/HDFS-14113
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14113-01.patch, HDFS-14113-02.patch, 
> HDFS-14113-03.patch
>
>
> By default addition of erasure coding policies is enabled for users.We need 
> to add configuration whether to allow addition of new User Defined policies 
> or not.Which can be configured in for of a Boolean value at the server side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14124) EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14124:
---
Attachment: HDFS-14124.branch-3.1.patch

> EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs
> -
>
> Key: HDFS-14124
> URL: https://issues.apache.org/jira/browse/HDFS-14124
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, httpfs, webhdfs
>Reporter: Souryakanta Dwivedy
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-14124-01.patch, HDFS-14124-02.patch, 
> HDFS-14124-03.patch, HDFS-14124-04.patch, HDFS-14124-04.patch, 
> HDFS-14124.branch-3.1.patch
>
>
> EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14124) EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14124:
---
Fix Version/s: 3.1.4

> EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs
> -
>
> Key: HDFS-14124
> URL: https://issues.apache.org/jira/browse/HDFS-14124
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, httpfs, webhdfs
>Reporter: Souryakanta Dwivedy
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-14124-01.patch, HDFS-14124-02.patch, 
> HDFS-14124-03.patch, HDFS-14124-04.patch, HDFS-14124-04.patch, 
> HDFS-14124.branch-3.1.patch
>
>
> EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14124) EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs

2019-10-03 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944129#comment-16944129
 ] 

Wei-Chiu Chuang commented on HDFS-14124:


Pushed to branch-3.1. There's just a trivial conflict in the doc. Attached  
[^HDFS-14124.branch-3.1.patch]  for posterity.

> EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs
> -
>
> Key: HDFS-14124
> URL: https://issues.apache.org/jira/browse/HDFS-14124
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, httpfs, webhdfs
>Reporter: Souryakanta Dwivedy
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-14124-01.patch, HDFS-14124-02.patch, 
> HDFS-14124-03.patch, HDFS-14124-04.patch, HDFS-14124-04.patch, 
> HDFS-14124.branch-3.1.patch
>
>
> EC : Support EC Commands (set/get/unset EcPolicy) via WebHdfs



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14187) Make warning message more clear when there are not enough data nodes for EC write

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14187:
---
Fix Version/s: 3.2.2
   3.1.4

> Make warning message more clear when there are not enough data nodes for EC 
> write
> -
>
> Key: HDFS-14187
> URL: https://issues.apache.org/jira/browse/HDFS-14187
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Affects Versions: 3.1.1
>Reporter: Kitti Nanasi
>Assignee: Kitti Nanasi
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14187.001.patch
>
>
> When setting an erasure coding policy for which there are not enough racks or 
> data nodes, write will fail with the following message:
> {code:java}
> [root@oks-upgrade6727-1 ~]# sudo -u systest hdfs dfs -mkdir 
> /user/systest/testdir
> [root@oks-upgrade6727-1 ~]# sudo -u hdfs hdfs ec -setPolicy -path 
> /user/systest/testdir
> Set default erasure coding policy on /user/systest/testdir
> [root@oks-upgrade6727-1 ~]# sudo -u systest hdfs dfs -put /tmp/file1 
> /user/systest/testdir
> 18/11/12 05:41:26 WARN hdfs.DFSOutputStream: Cannot allocate parity 
> block(index=3, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[]
> 18/11/12 05:41:26 WARN hdfs.DFSOutputStream: Cannot allocate parity 
> block(index=4, policy=RS-3-2-1024k). Not enough datanodes? Exclude nodes=[]
> 18/11/12 05:41:26 WARN hdfs.DFSOutputStream: Block group <1> failed to write 
> 2 blocks. It's at high risk of losing data.
> {code}
> I suggest to log a more descriptive message suggesting to use hdfs ec 
> -verifyCluster command to verify the cluster setup against the ec policies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14064) WEBHDFS: Support Enable/Disable EC Policy

2019-10-03 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944111#comment-16944111
 ] 

Wei-Chiu Chuang commented on HDFS-14064:


Pushed to branch-3.1 There's just a trivial conflict. Attached patch for 
posterity.  [^HDFS-14064.branch-3.1.patch] 

> WEBHDFS: Support Enable/Disable EC Policy
> -
>
> Key: HDFS-14064
> URL: https://issues.apache.org/jira/browse/HDFS-14064
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, webhdfs
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-14064-01.patch, HDFS-14064-02.patch, 
> HDFS-14064-03.patch, HDFS-14064-04.patch, HDFS-14064-04.patch, 
> HDFS-14064-05.patch, HDFS-14064.branch-3.1.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14064) WEBHDFS: Support Enable/Disable EC Policy

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14064:
---
Fix Version/s: 3.1.4

> WEBHDFS: Support Enable/Disable EC Policy
> -
>
> Key: HDFS-14064
> URL: https://issues.apache.org/jira/browse/HDFS-14064
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, webhdfs
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-14064-01.patch, HDFS-14064-02.patch, 
> HDFS-14064-03.patch, HDFS-14064-04.patch, HDFS-14064-04.patch, 
> HDFS-14064-05.patch, HDFS-14064.branch-3.1.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14064) WEBHDFS: Support Enable/Disable EC Policy

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14064:
---
Attachment: HDFS-14064.branch-3.1.patch

> WEBHDFS: Support Enable/Disable EC Policy
> -
>
> Key: HDFS-14064
> URL: https://issues.apache.org/jira/browse/HDFS-14064
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, webhdfs
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HDFS-14064-01.patch, HDFS-14064-02.patch, 
> HDFS-14064-03.patch, HDFS-14064-04.patch, HDFS-14064-04.patch, 
> HDFS-14064-05.patch, HDFS-14064.branch-3.1.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14064) WEBHDFS: Support Enable/Disable EC Policy

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14064:
---
Component/s: webhdfs
 erasure-coding

> WEBHDFS: Support Enable/Disable EC Policy
> -
>
> Key: HDFS-14064
> URL: https://issues.apache.org/jira/browse/HDFS-14064
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding, webhdfs
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HDFS-14064-01.patch, HDFS-14064-02.patch, 
> HDFS-14064-03.patch, HDFS-14064-04.patch, HDFS-14064-04.patch, 
> HDFS-14064-05.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14849) Erasure Coding: the internal block is replicated many times when datanode is decommissioning

2019-10-03 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944096#comment-16944096
 ] 

Wei-Chiu Chuang commented on HDFS-14849:


Cherrypicked the commit to branch-3.2 without conflicts.
There is a trivial conflict for branch-3.1. So attached a patch  
[^HDFS-14849.branch-3.1.patch] for posterity.

> Erasure Coding: the internal block is replicated many times when datanode is 
> decommissioning
> 
>
> Key: HDFS-14849
> URL: https://issues.apache.org/jira/browse/HDFS-14849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding
>Affects Versions: 3.3.0
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
>  Labels: EC, HDFS, NameNode
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14849.001.patch, HDFS-14849.002.patch, 
> HDFS-14849.branch-3.1.patch, fsck-file.png, liveBlockIndices.png, 
> scheduleReconstruction.png
>
>
> When the datanode keeping in DECOMMISSION_INPROGRESS status, the EC internal 
> block in that datanode will be replicated many times.
> // added 2019/09/19
> I reproduced this scenario in a 163 nodes cluster with decommission 100 nodes 
> simultaneously. 
>  !scheduleReconstruction.png! 
>  !fsck-file.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14849) Erasure Coding: the internal block is replicated many times when datanode is decommissioning

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14849:
---
Fix Version/s: 3.1.4

> Erasure Coding: the internal block is replicated many times when datanode is 
> decommissioning
> 
>
> Key: HDFS-14849
> URL: https://issues.apache.org/jira/browse/HDFS-14849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding
>Affects Versions: 3.3.0
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
>  Labels: EC, HDFS, NameNode
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14849.001.patch, HDFS-14849.002.patch, 
> HDFS-14849.branch-3.1.patch, fsck-file.png, liveBlockIndices.png, 
> scheduleReconstruction.png
>
>
> When the datanode keeping in DECOMMISSION_INPROGRESS status, the EC internal 
> block in that datanode will be replicated many times.
> // added 2019/09/19
> I reproduced this scenario in a 163 nodes cluster with decommission 100 nodes 
> simultaneously. 
>  !scheduleReconstruction.png! 
>  !fsck-file.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14849) Erasure Coding: the internal block is replicated many times when datanode is decommissioning

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14849:
---
Attachment: HDFS-14849.branch-3.1.patch

> Erasure Coding: the internal block is replicated many times when datanode is 
> decommissioning
> 
>
> Key: HDFS-14849
> URL: https://issues.apache.org/jira/browse/HDFS-14849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding
>Affects Versions: 3.3.0
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
>  Labels: EC, HDFS, NameNode
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14849.001.patch, HDFS-14849.002.patch, 
> HDFS-14849.branch-3.1.patch, fsck-file.png, liveBlockIndices.png, 
> scheduleReconstruction.png
>
>
> When the datanode keeping in DECOMMISSION_INPROGRESS status, the EC internal 
> block in that datanode will be replicated many times.
> // added 2019/09/19
> I reproduced this scenario in a 163 nodes cluster with decommission 100 nodes 
> simultaneously. 
>  !scheduleReconstruction.png! 
>  !fsck-file.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14849) Erasure Coding: the internal block is replicated many times when datanode is decommissioning

2019-10-03 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14849:
---
Fix Version/s: 3.2.2

> Erasure Coding: the internal block is replicated many times when datanode is 
> decommissioning
> 
>
> Key: HDFS-14849
> URL: https://issues.apache.org/jira/browse/HDFS-14849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding
>Affects Versions: 3.3.0
>Reporter: HuangTao
>Assignee: HuangTao
>Priority: Major
>  Labels: EC, HDFS, NameNode
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14849.001.patch, HDFS-14849.002.patch, 
> fsck-file.png, liveBlockIndices.png, scheduleReconstruction.png
>
>
> When the datanode keeping in DECOMMISSION_INPROGRESS status, the EC internal 
> block in that datanode will be replicated many times.
> // added 2019/09/19
> I reproduced this scenario in a 163 nodes cluster with decommission 100 nodes 
> simultaneously. 
>  !scheduleReconstruction.png! 
>  !fsck-file.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14754) Erasure Coding : The number of Under-Replicated Blocks never reduced

2019-10-03 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943826#comment-16943826
 ] 

Wei-Chiu Chuang commented on HDFS-14754:


[~surendrasingh] [~hemanthboyina]
the test doesn't seem valid. If i remove the fix, the test still passes. Can 
you please recheck?

> Erasure Coding :  The number of Under-Replicated Blocks never reduced
> -
>
> Key: HDFS-14754
> URL: https://issues.apache.org/jira/browse/HDFS-14754
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Critical
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14754-addendum.001.patch, HDFS-14754.001.patch, 
> HDFS-14754.002.patch, HDFS-14754.003.patch, HDFS-14754.004.patch, 
> HDFS-14754.005.patch, HDFS-14754.006.patch, HDFS-14754.007.patch, 
> HDFS-14754.008.patch, HDFS-14754.branch-3.1.patch
>
>
> Using EC RS-3-2, 6 DN 
> We came accross a scenario where in the EC 5 blocks , same block is 
> replicated thrice and two blocks got missing
> Replicated block was not deleting and missing block is not able to ReConstruct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14637) Namenode may not replicate blocks to meet the policy after enabling upgradeDomain

2019-10-03 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943740#comment-16943740
 ] 

Wei-Chiu Chuang commented on HDFS-14637:


+1

> Namenode may not replicate blocks to meet the policy after enabling 
> upgradeDomain
> -
>
> Key: HDFS-14637
> URL: https://issues.apache.org/jira/browse/HDFS-14637
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Attachments: HDFS-14637.001.patch, HDFS-14637.002.patch, 
> HDFS-14637.003.patch, HDFS-14637.004.patch, HDFS-14637.005.patch
>
>
> After changing the network topology or placement policy on a cluster and 
> restarting the namenode, the namenode will scan all blocks on the cluster at 
> startup, and check if they meet the current placement policy. If they do not, 
> they are added to the replication queue and the namenode will arrange for 
> them to be replicated to ensure the placement policy is used.
> If you start with a cluster with no UpgradeDomain, and then enable 
> UpgradeDomain, then on restart the NN does notice all the blocks violate the 
> placement policy and it adds them to the replication queue. I believe there 
> are some issues in the logic that prevents the blocks from replicating 
> depending on the setup:
> With UD enabled, but no racks configured, and possible on a 2 rack cluster, 
> the queued replication work never makes any progress, as in 
> blockManager.validateReconstructionWork(), it checks to see if the new 
> replica increases the number of racks, and if it does not, it skips it and 
> tries again later.
> {code:java}
> DatanodeStorageInfo[] targets = rw.getTargets();
> if ((numReplicas.liveReplicas() >= requiredRedundancy) &&
> (!isPlacementPolicySatisfied(block)) ) {
>   if (!isInNewRack(rw.getSrcNodes(), targets[0].getDatanodeDescriptor())) {
> // No use continuing, unless a new rack in this case
> return false;
>   }
>   // mark that the reconstruction work is to replicate internal block to a
>   // new rack.
>   rw.setNotEnoughRack();
> }
> {code}
> Additionally, in blockManager.scheduleReconstruction() is there some logic 
> that sets the number of new replicas required to one, if the live replicas >= 
> requiredReduncancy:
> {code:java}
> int additionalReplRequired;
> if (numReplicas.liveReplicas() < requiredRedundancy) {
>   additionalReplRequired = requiredRedundancy - numReplicas.liveReplicas()
>   - pendingNum;
> } else {
>   additionalReplRequired = 1; // Needed on a new rack
> }{code}
> With UD, it is possible for 2 new replicas to be needed to meet the block 
> placement policy, if all existing replicas are on nodes with the same domain. 
> For traditional '2 rack redundancy', only 1 new replica would ever have been 
> needed in this scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14216) NullPointerException happens in NamenodeWebHdfs

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14216:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

> NullPointerException happens in NamenodeWebHdfs
> ---
>
> Key: HDFS-14216
> URL: https://issues.apache.org/jira/browse/HDFS-14216
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.3.0, 3.1.4, 3.2.1
>
> Attachments: HDFS-14216.branch-3.1.patch, HDFS-14216_1.patch, 
> HDFS-14216_2.patch, HDFS-14216_3.patch, HDFS-14216_4.patch, 
> HDFS-14216_5.patch, HDFS-14216_6.patch, hadoop-hires-namenode-hadoop11.log
>
>
>  workload
> {code:java}
> curl -i -X PUT -T $HOMEPARH/test.txt 
> "http://hadoop1:9870/webhdfs/v1/input?op=CREATE=hadoop2;
> {code}
> the method
> {code:java}
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(String
>  excludeDatanodes){
>     HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>for (String host : StringUtils
>  .getTrimmedStringCollection(excludeDatanodes)) {
>  int idx = host.indexOf(":");
>if (idx != -1) { 
> excludes.add(bm.getDatanodeManager().getDatanodeByXferAddr(
>host.substring(0, idx), Integer.parseInt(host.substring(idx + 
> 1;
>} else {
>   
> excludes.add(bm.getDatanodeManager().getDatanodeByHost(host));//line280
>}
>   }
> }
> }
> {code}
> when datanode(e.g.hadoop2) is {color:#d04437}just  wiped before 
> line280{color}, or{color:#33} 
> {color}{color:#ff}we{color}{color:#ff} give the wrong DN 
> name{color}*,*then  bm.getDatanodeManager().getDatanodeByHost(host) will 
> return null, *_excludes_* *containes null*. while *_excludes_* are used 
> later, NPE happens:
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.net.NodeBase.getPath(NodeBase.java:113)
> at 
> org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:672)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:491)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(NamenodeWebHdfsMethods.java:323)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.redirectURI(NamenodeWebHdfsMethods.java:384)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.put(NamenodeWebHdfsMethods.java:652)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:600)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:597)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:73)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:30)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14216) NullPointerException happens in NamenodeWebHdfs

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14216:
---
Fix Version/s: 3.1.4

> NullPointerException happens in NamenodeWebHdfs
> ---
>
> Key: HDFS-14216
> URL: https://issues.apache.org/jira/browse/HDFS-14216
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-14216.branch-3.1.patch, HDFS-14216_1.patch, 
> HDFS-14216_2.patch, HDFS-14216_3.patch, HDFS-14216_4.patch, 
> HDFS-14216_5.patch, HDFS-14216_6.patch, hadoop-hires-namenode-hadoop11.log
>
>
>  workload
> {code:java}
> curl -i -X PUT -T $HOMEPARH/test.txt 
> "http://hadoop1:9870/webhdfs/v1/input?op=CREATE=hadoop2;
> {code}
> the method
> {code:java}
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(String
>  excludeDatanodes){
>     HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>for (String host : StringUtils
>  .getTrimmedStringCollection(excludeDatanodes)) {
>  int idx = host.indexOf(":");
>if (idx != -1) { 
> excludes.add(bm.getDatanodeManager().getDatanodeByXferAddr(
>host.substring(0, idx), Integer.parseInt(host.substring(idx + 
> 1;
>} else {
>   
> excludes.add(bm.getDatanodeManager().getDatanodeByHost(host));//line280
>}
>   }
> }
> }
> {code}
> when datanode(e.g.hadoop2) is {color:#d04437}just  wiped before 
> line280{color}, or{color:#33} 
> {color}{color:#ff}we{color}{color:#ff} give the wrong DN 
> name{color}*,*then  bm.getDatanodeManager().getDatanodeByHost(host) will 
> return null, *_excludes_* *containes null*. while *_excludes_* are used 
> later, NPE happens:
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.net.NodeBase.getPath(NodeBase.java:113)
> at 
> org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:672)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:491)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(NamenodeWebHdfsMethods.java:323)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.redirectURI(NamenodeWebHdfsMethods.java:384)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.put(NamenodeWebHdfsMethods.java:652)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:600)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:597)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:73)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:30)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14216) NullPointerException happens in NamenodeWebHdfs

2019-10-02 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943294#comment-16943294
 ] 

Wei-Chiu Chuang commented on HDFS-14216:


Failure doesn't look related. Pushing it to branch-3.1

> NullPointerException happens in NamenodeWebHdfs
> ---
>
> Key: HDFS-14216
> URL: https://issues.apache.org/jira/browse/HDFS-14216
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HDFS-14216.branch-3.1.patch, HDFS-14216_1.patch, 
> HDFS-14216_2.patch, HDFS-14216_3.patch, HDFS-14216_4.patch, 
> HDFS-14216_5.patch, HDFS-14216_6.patch, hadoop-hires-namenode-hadoop11.log
>
>
>  workload
> {code:java}
> curl -i -X PUT -T $HOMEPARH/test.txt 
> "http://hadoop1:9870/webhdfs/v1/input?op=CREATE=hadoop2;
> {code}
> the method
> {code:java}
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(String
>  excludeDatanodes){
>     HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>for (String host : StringUtils
>  .getTrimmedStringCollection(excludeDatanodes)) {
>  int idx = host.indexOf(":");
>if (idx != -1) { 
> excludes.add(bm.getDatanodeManager().getDatanodeByXferAddr(
>host.substring(0, idx), Integer.parseInt(host.substring(idx + 
> 1;
>} else {
>   
> excludes.add(bm.getDatanodeManager().getDatanodeByHost(host));//line280
>}
>   }
> }
> }
> {code}
> when datanode(e.g.hadoop2) is {color:#d04437}just  wiped before 
> line280{color}, or{color:#33} 
> {color}{color:#ff}we{color}{color:#ff} give the wrong DN 
> name{color}*,*then  bm.getDatanodeManager().getDatanodeByHost(host) will 
> return null, *_excludes_* *containes null*. while *_excludes_* are used 
> later, NPE happens:
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.net.NodeBase.getPath(NodeBase.java:113)
> at 
> org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:672)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:491)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(NamenodeWebHdfsMethods.java:323)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.redirectURI(NamenodeWebHdfsMethods.java:384)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.put(NamenodeWebHdfsMethods.java:652)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:600)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:597)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:73)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:30)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14678) Allow triggerBlockReport to a specific namenode

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14678:
---
Fix Version/s: 3.2.2
   3.1.4

> Allow triggerBlockReport to a specific namenode
> ---
>
> Key: HDFS-14678
> URL: https://issues.apache.org/jira/browse/HDFS-14678
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.2
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
>
> In our largest prod cluster (running 2.8.2) we have >3k hosts. Every time 
> when rolling restarting NNs we will need to wait for block report which takes 
> >2.5 hours for each NN.
> One way to make it faster is to manually trigger a full block report from all 
> datanodes. [HDFS-7278|https://issues.apache.org/jira/browse/HDFS-7278]. 
> However, the current triggerBlockReport command will trigger a block report 
> on all NNs which will flood the active NN as well.
> A quick solution will be adding an option to specify a NN that the manually 
> triggered block report will go to, something like:
> *_hdfs dfsadmin [-triggerBlockReport [-incremental] ] 
> [-namenode] _*
> So when doing a restart of standby NN or observer NN we can trigger an 
> aggressive block report to a specific NN to exit safemode faster without 
> risking active NN performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14678) Allow triggerBlockReport to a specific namenode

2019-10-02 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943259#comment-16943259
 ] 

Wei-Chiu Chuang commented on HDFS-14678:


Cherry picking the commit into branch-3.2 and branch-3.1.
There's just a trivial conflict in the test code due to HADOOP-14178. 

> Allow triggerBlockReport to a specific namenode
> ---
>
> Key: HDFS-14678
> URL: https://issues.apache.org/jira/browse/HDFS-14678
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.2
>Reporter: Leon Gao
>Assignee: Leon Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> In our largest prod cluster (running 2.8.2) we have >3k hosts. Every time 
> when rolling restarting NNs we will need to wait for block report which takes 
> >2.5 hours for each NN.
> One way to make it faster is to manually trigger a full block report from all 
> datanodes. [HDFS-7278|https://issues.apache.org/jira/browse/HDFS-7278]. 
> However, the current triggerBlockReport command will trigger a block report 
> on all NNs which will flood the active NN as well.
> A quick solution will be adding an option to specify a NN that the manually 
> triggered block report will go to, something like:
> *_hdfs dfsadmin [-triggerBlockReport [-incremental] ] 
> [-namenode] _*
> So when doing a restart of standby NN or observer NN we can trigger an 
> aggressive block report to a specific NN to exit safemode faster without 
> risking active NN performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8881) Erasure Coding: internal blocks got missed and got over-replicated at the same time

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-8881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-8881:
--
Resolution: Duplicate
Status: Resolved  (was: Patch Available)

> Erasure Coding: internal blocks got missed and got over-replicated at the 
> same time
> ---
>
> Key: HDFS-8881
> URL: https://issues.apache.org/jira/browse/HDFS-8881
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: erasure-coding
>Reporter: Walter Su
>Assignee: Walter Su
>Priority: Major
> Attachments: HDFS-8881.00.patch
>
>
> We know the Repl checking depends on {{BlockManager#countNodes()}}, but 
> countNodes() has limitation for striped blockGroup.
> *One* missing internal block will be catched by Repl checking, and handled by 
> ReplicationMonitor.
> *One* over-replicated internal block will be catched by Repl checking, and 
> handled by processOverReplicatedBlocks.
> *One* missing internal block and *two* over-replicated internal blocks *at 
> the same time* will be catched by Repl checking, and handled by 
> processOverReplicatedBlocks, later by ReplicationMonitor.
> *One* missing internal block and *One* over-replicated internal block *at the 
> same time* will *NOT* be catched by Repl checking.
> "at the same time" means one missing internal block can't be recovered, and 
> one internal block got over-replicated anyway. For example:
> scenario A:
> step 1. block #0 and #1 are reported missing.
> 2. a new #1 got recovered.
> 3. the old #1 come back, and the recovery work for #0 failed.
> scenario B:
> 1. An DN decommissioned/dead which has #1.
> 2. block #0 is reported missing.
> 3. The DN has #1 recommisioned, and the recovery work for #0 failed.
> In the end, the blockGroup has \[1, 1, 2, 3, 4, 5, 6, 7, 8\], assume 6+3 
> schema. Client always needs to decode #0 if the blockGroup doesn't get 
> handled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14523) Remove excess read lock for NetworkToplogy

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14523:
---
Fix Version/s: 3.2.2
   3.1.4

> Remove excess read lock for NetworkToplogy
> --
>
> Key: HDFS-14523
> URL: https://issues.apache.org/jira/browse/HDFS-14523
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Wu Weiwei
>Assignee: Wu Weiwei
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14523.1.patch
>
>
> getNumOfRacks() and getNumOfLeaves() are two high frequencies call methods 
> for BlockPlacementPolicy, this two methods need to get NetworkTopology read 
> lock, and get lock in high frequencies call methods may impact the namenode 
> performance. 
> This two methods get number of racks and number of leaves just for 
> chooseTarget calculate,  lock in these two methods cannot guarantee these two 
> values will not change in the subsequent calculations.
> I think it's safe to remove the read lock from this two methods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14618) Incorrect synchronization of ArrayList field (ArrayList is thread-unsafe).

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14618:
---
Fix Version/s: 3.2.2
   3.1.4

> Incorrect synchronization of ArrayList field (ArrayList is thread-unsafe).
> --
>
> Key: HDFS-14618
> URL: https://issues.apache.org/jira/browse/HDFS-14618
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Paul Ward
>Assignee: Paul Ward
>Priority: Critical
>  Labels: fix-provided, patch-available
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: race.patch
>
>
> I submitted a  CR for this issue at:
> https://github.com/apache/hadoop/pull/1030
> The field {{timedOutItems}}  (an {{ArrayList}}, i.e., not thread safe):
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingReconstructionBlocks.java#L70
> is protected by synchronization on itself ({{timedOutItems}}):
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingReconstructionBlocks.java#L167-L168
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingReconstructionBlocks.java#L267-L268
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingReconstructionBlocks.java#L178
> However, in one place:
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/PendingReconstructionBlocks.java#L133-L135
> it is (trying to be) protected by synchronized using 
> {{pendingReconstructions}} --- but this cannot protect {{timedOutItems}}.
> Synchronized on different objects does not ensure mutual exclusion with the 
> other locations.
> I.e., 2 code locations, one synchronized by {{pendingReconstructions}} and 
> the other by {{timedOutItems}} can still executed concurrently.
> This CR adds the synchronized on {{timedOutItems}}.
> Note that this CR keeps the synchronized on {{pendingReconstructions}}, which 
> is needed for a different purpose (protect {{pendingReconstructions}})



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14216) NullPointerException happens in NamenodeWebHdfs

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14216:
---
Attachment: HDFS-14216.branch-3.1.patch

> NullPointerException happens in NamenodeWebHdfs
> ---
>
> Key: HDFS-14216
> URL: https://issues.apache.org/jira/browse/HDFS-14216
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HDFS-14216.branch-3.1.patch, HDFS-14216_1.patch, 
> HDFS-14216_2.patch, HDFS-14216_3.patch, HDFS-14216_4.patch, 
> HDFS-14216_5.patch, HDFS-14216_6.patch, hadoop-hires-namenode-hadoop11.log
>
>
>  workload
> {code:java}
> curl -i -X PUT -T $HOMEPARH/test.txt 
> "http://hadoop1:9870/webhdfs/v1/input?op=CREATE=hadoop2;
> {code}
> the method
> {code:java}
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(String
>  excludeDatanodes){
>     HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>for (String host : StringUtils
>  .getTrimmedStringCollection(excludeDatanodes)) {
>  int idx = host.indexOf(":");
>if (idx != -1) { 
> excludes.add(bm.getDatanodeManager().getDatanodeByXferAddr(
>host.substring(0, idx), Integer.parseInt(host.substring(idx + 
> 1;
>} else {
>   
> excludes.add(bm.getDatanodeManager().getDatanodeByHost(host));//line280
>}
>   }
> }
> }
> {code}
> when datanode(e.g.hadoop2) is {color:#d04437}just  wiped before 
> line280{color}, or{color:#33} 
> {color}{color:#ff}we{color}{color:#ff} give the wrong DN 
> name{color}*,*then  bm.getDatanodeManager().getDatanodeByHost(host) will 
> return null, *_excludes_* *containes null*. while *_excludes_* are used 
> later, NPE happens:
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.net.NodeBase.getPath(NodeBase.java:113)
> at 
> org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:672)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:491)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(NamenodeWebHdfsMethods.java:323)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.redirectURI(NamenodeWebHdfsMethods.java:384)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.put(NamenodeWebHdfsMethods.java:652)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:600)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:597)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:73)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:30)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14216) NullPointerException happens in NamenodeWebHdfs

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14216:
---
Status: Patch Available  (was: Reopened)

> NullPointerException happens in NamenodeWebHdfs
> ---
>
> Key: HDFS-14216
> URL: https://issues.apache.org/jira/browse/HDFS-14216
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HDFS-14216.branch-3.1.patch, HDFS-14216_1.patch, 
> HDFS-14216_2.patch, HDFS-14216_3.patch, HDFS-14216_4.patch, 
> HDFS-14216_5.patch, HDFS-14216_6.patch, hadoop-hires-namenode-hadoop11.log
>
>
>  workload
> {code:java}
> curl -i -X PUT -T $HOMEPARH/test.txt 
> "http://hadoop1:9870/webhdfs/v1/input?op=CREATE=hadoop2;
> {code}
> the method
> {code:java}
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(String
>  excludeDatanodes){
>     HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>for (String host : StringUtils
>  .getTrimmedStringCollection(excludeDatanodes)) {
>  int idx = host.indexOf(":");
>if (idx != -1) { 
> excludes.add(bm.getDatanodeManager().getDatanodeByXferAddr(
>host.substring(0, idx), Integer.parseInt(host.substring(idx + 
> 1;
>} else {
>   
> excludes.add(bm.getDatanodeManager().getDatanodeByHost(host));//line280
>}
>   }
> }
> }
> {code}
> when datanode(e.g.hadoop2) is {color:#d04437}just  wiped before 
> line280{color}, or{color:#33} 
> {color}{color:#ff}we{color}{color:#ff} give the wrong DN 
> name{color}*,*then  bm.getDatanodeManager().getDatanodeByHost(host) will 
> return null, *_excludes_* *containes null*. while *_excludes_* are used 
> later, NPE happens:
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.net.NodeBase.getPath(NodeBase.java:113)
> at 
> org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:672)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:491)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(NamenodeWebHdfsMethods.java:323)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.redirectURI(NamenodeWebHdfsMethods.java:384)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.put(NamenodeWebHdfsMethods.java:652)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:600)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:597)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:73)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:30)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14216) NullPointerException happens in NamenodeWebHdfs

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reopened HDFS-14216:


Reopen for branch-3.1. The only thing different is the LOG class change. Can't 
use parameterized logging.

> NullPointerException happens in NamenodeWebHdfs
> ---
>
> Key: HDFS-14216
> URL: https://issues.apache.org/jira/browse/HDFS-14216
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: lujie
>Assignee: lujie
>Priority: Critical
> Fix For: 3.3.0, 3.2.1
>
> Attachments: HDFS-14216_1.patch, HDFS-14216_2.patch, 
> HDFS-14216_3.patch, HDFS-14216_4.patch, HDFS-14216_5.patch, 
> HDFS-14216_6.patch, hadoop-hires-namenode-hadoop11.log
>
>
>  workload
> {code:java}
> curl -i -X PUT -T $HOMEPARH/test.txt 
> "http://hadoop1:9870/webhdfs/v1/input?op=CREATE=hadoop2;
> {code}
> the method
> {code:java}
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(String
>  excludeDatanodes){
>     HashSet excludes = new HashSet();
> if (excludeDatanodes != null) {
>for (String host : StringUtils
>  .getTrimmedStringCollection(excludeDatanodes)) {
>  int idx = host.indexOf(":");
>if (idx != -1) { 
> excludes.add(bm.getDatanodeManager().getDatanodeByXferAddr(
>host.substring(0, idx), Integer.parseInt(host.substring(idx + 
> 1;
>} else {
>   
> excludes.add(bm.getDatanodeManager().getDatanodeByHost(host));//line280
>}
>   }
> }
> }
> {code}
> when datanode(e.g.hadoop2) is {color:#d04437}just  wiped before 
> line280{color}, or{color:#33} 
> {color}{color:#ff}we{color}{color:#ff} give the wrong DN 
> name{color}*,*then  bm.getDatanodeManager().getDatanodeByHost(host) will 
> return null, *_excludes_* *containes null*. while *_excludes_* are used 
> later, NPE happens:
> {code:java}
> java.lang.NullPointerException
> at org.apache.hadoop.net.NodeBase.getPath(NodeBase.java:113)
> at 
> org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:672)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
> at 
> org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:491)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.chooseDatanode(NamenodeWebHdfsMethods.java:323)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.redirectURI(NamenodeWebHdfsMethods.java:384)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods.put(NamenodeWebHdfsMethods.java:652)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:600)
> at 
> org.apache.hadoop.hdfs.server.namenode.web.resources.NamenodeWebHdfsMethods$2.run(NamenodeWebHdfsMethods.java:597)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:73)
> at org.apache.hadoop.ipc.ExternalCall.run(ExternalCall.java:30)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2830)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14610) HashMap is not thread safe. Field storageMap is typically synchronized by storageMap. However, in one place, field storageMap is not protected with synchronized.

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14610:
---
Fix Version/s: 3.2.2
   3.1.4

> HashMap is not thread safe. Field storageMap is typically synchronized by 
> storageMap. However, in one place, field storageMap is not protected with 
> synchronized.
> -
>
> Key: HDFS-14610
> URL: https://issues.apache.org/jira/browse/HDFS-14610
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Paul Ward
>Assignee: Paul Ward
>Priority: Critical
>  Labels: fix-provided, patch-available
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: addingSynchronization.patch
>
>
> I submitted a CR for this issue at:
> [https://github.com/apache/hadoop/pull/1015]
> The field *storageMap* (a *HashMap*)
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L155]
> is typically protected by synchronization on *storageMap*, e.g.,
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L294]
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L443]
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L484]
> For a total of 9 locations.
> The reason is because *HashMap* is not thread safe.
> However, here:
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L455]
> {{DatanodeStorageInfo storage =}}
> {{   storageMap.get(report.getStorage().getStorageID());}}
> It is not synchronized.
> Note that in the same method:
> [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java#L484]
> *storageMap* is again protected by synchronization:
> {{synchronized (storageMap) {}}
> {{   storageMapSize = storageMap.size();}}
> {{}}}
>  
> The CR I inlined above protected the above instance (line 455 ) with 
> synchronization
>  like in line 484 and in all other occurrences.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-10-02 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16943165#comment-16943165
 ] 

Wei-Chiu Chuang commented on HDFS-14527:


Patch applies cleanly in branch-3.2 also.
But it doesn't compile in branch-3.1. I'll provide a patch shortly.

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14527) Stop all DataNodes may result in NN terminate

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14527:
---
Fix Version/s: 3.2.2

> Stop all DataNodes may result in NN terminate
> -
>
> Key: HDFS-14527
> URL: https://issues.apache.org/jira/browse/HDFS-14527
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Xiaoqiao He
>Assignee: Xiaoqiao He
>Priority: Major
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14527.001.patch, HDFS-14527.002.patch, 
> HDFS-14527.003.patch, HDFS-14527.004.patch, HDFS-14527.005.patch
>
>
> If we stop all datanodes of cluster, BlockPlacementPolicyDefault#chooseTarget 
> may get ArithmeticException when calling #getMaxNodesPerRack, which throws 
> the runtime exception out to BlockManager's ReplicationMonitor thread and 
> then terminate the NN.
> The root cause is that BlockPlacementPolicyDefault#chooseTarget not hold the 
> global lock, and if all DataNodes are dead between 
> {{clusterMap.getNumberOfLeaves()}} and {{getMaxNodesPerRack}} then it meet 
> {{ArithmeticException}} while invoke {{getMaxNodesPerRack}}.
> {code:java}
>   private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
> Node writer,
> List chosenStorage,
> boolean returnChosenNodes,
> Set excludedNodes,
> long blocksize,
> final BlockStoragePolicy storagePolicy,
> EnumSet addBlockFlags,
> EnumMap sTypes) {
> if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
>   return DatanodeStorageInfo.EMPTY_ARRAY;
> }
> ..
> int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
> ..
> }
> {code}
> Some detailed log show as following.
> {code:java}
> 2019-05-31 12:29:21,803 ERROR 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception. 
> java.lang.ArithmeticException: / by zero
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getMaxNodesPerRack(BlockPlacementPolicyDefault.java:282)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:228)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:132)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.chooseTargets(BlockManager.java:4533)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationWork.access$1800(BlockManager.java:4493)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1954)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1830)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4453)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:4388)
> at java.lang.Thread.run(Thread.java:745)
> 2019-05-31 12:29:21,805 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1
> {code}
> To be honest, this is not serious bug and not reprod easily, since if we stop 
> all Datanodes and only keep NameNode lives, HDFS could be not offer service 
> normally and we could only retrieve directory. It may be one corner case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14808) EC: Improper size values for corrupt ec block in LOG

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14808:
---
Fix Version/s: 3.2.2
   3.1.4

> EC: Improper size values for corrupt ec block in LOG 
> -
>
> Key: HDFS-14808
> URL: https://issues.apache.org/jira/browse/HDFS-14808
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Harshakiran Reddy
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14808-01.patch
>
>
> If the block corruption reason is size mismatch the log. The values shown and 
> compared are ambiguous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14699) Erasure Coding: Storage not considered in live replica when replication streams hard limit reached to threshold

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14699:
---
Fix Version/s: 3.1.4

> Erasure Coding: Storage not considered in live replica when replication 
> streams hard limit reached to threshold
> ---
>
> Key: HDFS-14699
> URL: https://issues.apache.org/jira/browse/HDFS-14699
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Affects Versions: 3.2.0, 3.1.1, 3.3.0
>Reporter: Zhao Yi Ming
>Assignee: Zhao Yi Ming
>Priority: Critical
>  Labels: patch
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14699.00.patch, HDFS-14699.01.patch, 
> HDFS-14699.02.patch, HDFS-14699.03.patch, HDFS-14699.04.patch, 
> HDFS-14699.05.patch, image-2019-08-20-19-58-51-872.png, 
> image-2019-09-02-17-51-46-742.png
>
>
> We are tried the EC function on 80 node cluster with hadoop 3.1.1, we hit the 
> same scenario as you said https://issues.apache.org/jira/browse/HDFS-8881. 
> Following are our testing steps, hope it can helpful.(following DNs have the 
> testing internal blocks)
>  # we customized a new 10-2-1024k policy and use it on a path, now we have 12 
> internal block(12 live block)
>  # decommission one DN, after the decommission complete. now we have 13 
> internal block(12 live block and 1 decommission block)
>  # then shutdown one DN which did not have the same block id as 1 
> decommission block, now we have 12 internal block(11 live block and 1 
> decommission block)
>  # after wait for about 600s (before the heart beat come) commission the 
> decommissioned DN again, now we have 12 internal block(11 live block and 1 
> duplicate block)
>  # Then the EC is not reconstruct the missed block
> We think this is a critical issue for using the EC function in a production 
> env. Could you help? Thanks a lot!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14754) Erasure Coding : The number of Under-Replicated Blocks never reduced

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14754:
---
Status: Patch Available  (was: Reopened)

Reopen & submit the branch-3.1 patch.
Branch-3.2 was cherry picked without conflict.

> Erasure Coding :  The number of Under-Replicated Blocks never reduced
> -
>
> Key: HDFS-14754
> URL: https://issues.apache.org/jira/browse/HDFS-14754
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Critical
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14754-addendum.001.patch, HDFS-14754.001.patch, 
> HDFS-14754.002.patch, HDFS-14754.003.patch, HDFS-14754.004.patch, 
> HDFS-14754.005.patch, HDFS-14754.006.patch, HDFS-14754.007.patch, 
> HDFS-14754.008.patch, HDFS-14754.branch-3.1.patch
>
>
> Using EC RS-3-2, 6 DN 
> We came accross a scenario where in the EC 5 blocks , same block is 
> replicated thrice and two blocks got missing
> Replicated block was not deleting and missing block is not able to ReConstruct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14754) Erasure Coding : The number of Under-Replicated Blocks never reduced

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reopened HDFS-14754:


> Erasure Coding :  The number of Under-Replicated Blocks never reduced
> -
>
> Key: HDFS-14754
> URL: https://issues.apache.org/jira/browse/HDFS-14754
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Critical
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14754-addendum.001.patch, HDFS-14754.001.patch, 
> HDFS-14754.002.patch, HDFS-14754.003.patch, HDFS-14754.004.patch, 
> HDFS-14754.005.patch, HDFS-14754.006.patch, HDFS-14754.007.patch, 
> HDFS-14754.008.patch, HDFS-14754.branch-3.1.patch
>
>
> Using EC RS-3-2, 6 DN 
> We came accross a scenario where in the EC 5 blocks , same block is 
> replicated thrice and two blocks got missing
> Replicated block was not deleting and missing block is not able to ReConstruct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14754) Erasure Coding : The number of Under-Replicated Blocks never reduced

2019-10-02 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14754:
---
Fix Version/s: 3.2.2

> Erasure Coding :  The number of Under-Replicated Blocks never reduced
> -
>
> Key: HDFS-14754
> URL: https://issues.apache.org/jira/browse/HDFS-14754
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Critical
> Fix For: 3.3.0, 3.2.2
>
> Attachments: HDFS-14754-addendum.001.patch, HDFS-14754.001.patch, 
> HDFS-14754.002.patch, HDFS-14754.003.patch, HDFS-14754.004.patch, 
> HDFS-14754.005.patch, HDFS-14754.006.patch, HDFS-14754.007.patch, 
> HDFS-14754.008.patch, HDFS-14754.branch-3.1.patch
>
>
> Using EC RS-3-2, 6 DN 
> We came accross a scenario where in the EC 5 blocks , same block is 
> replicated thrice and two blocks got missing
> Replicated block was not deleting and missing block is not able to ReConstruct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du

2019-10-01 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942401#comment-16942401
 ] 

Wei-Chiu Chuang commented on HDFS-14313:


Conflicts are trivial so I pushed them into branch-3.2 and branch-3.1. Patch 
attached to the jira for posterity.

> Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory  
> instead of df/du
> 
>
> Key: HDFS-14313
> URL: https://issues.apache.org/jira/browse/HDFS-14313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, performance
>Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14313-branch-2.v1.patch, 
> HDFS-14313-branch-2.v2.patch, HDFS-14313.000.patch, HDFS-14313.001.patch, 
> HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, 
> HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, 
> HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, 
> HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, 
> HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, 
> HDFS-14313.branch-3.0.v2.patch, HDFS-14313.branch-3.1.patch, 
> HDFS-14313.branch-3.2.patch, HDFS-14313.branch-3.v1.patch
>
>
> There are two ways of DU/DF getting used space that are insufficient.
>  #  Running DU across lots of disks is very expensive and running all of the 
> processes at the same time creates a noticeable IO spike.
>  #  Running DF is inaccurate when the disk sharing by multiple datanode or 
> other servers.
>  Getting hdfs used space from  FsDatasetImpl#volumeMap#ReplicaInfos in memory 
> is very small and accurate. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14313:
---
Fix Version/s: 3.2.2
   3.1.4

> Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory  
> instead of df/du
> 
>
> Key: HDFS-14313
> URL: https://issues.apache.org/jira/browse/HDFS-14313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, performance
>Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14313-branch-2.v1.patch, 
> HDFS-14313-branch-2.v2.patch, HDFS-14313.000.patch, HDFS-14313.001.patch, 
> HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, 
> HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, 
> HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, 
> HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, 
> HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, 
> HDFS-14313.branch-3.0.v2.patch, HDFS-14313.branch-3.1.patch, 
> HDFS-14313.branch-3.2.patch, HDFS-14313.branch-3.v1.patch
>
>
> There are two ways of DU/DF getting used space that are insufficient.
>  #  Running DU across lots of disks is very expensive and running all of the 
> processes at the same time creates a noticeable IO spike.
>  #  Running DF is inaccurate when the disk sharing by multiple datanode or 
> other servers.
>  Getting hdfs used space from  FsDatasetImpl#volumeMap#ReplicaInfos in memory 
> is very small and accurate. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14313:
---
Attachment: HDFS-14313.branch-3.2.patch
HDFS-14313.branch-3.1.patch

> Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory  
> instead of df/du
> 
>
> Key: HDFS-14313
> URL: https://issues.apache.org/jira/browse/HDFS-14313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, performance
>Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0
>
> Attachments: HDFS-14313-branch-2.v1.patch, 
> HDFS-14313-branch-2.v2.patch, HDFS-14313.000.patch, HDFS-14313.001.patch, 
> HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, 
> HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, 
> HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, 
> HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, 
> HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, 
> HDFS-14313.branch-3.0.v2.patch, HDFS-14313.branch-3.1.patch, 
> HDFS-14313.branch-3.2.patch, HDFS-14313.branch-3.v1.patch
>
>
> There are two ways of DU/DF getting used space that are insufficient.
>  #  Running DU across lots of disks is very expensive and running all of the 
> processes at the same time creates a noticeable IO spike.
>  #  Running DF is inaccurate when the disk sharing by multiple datanode or 
> other servers.
>  Getting hdfs used space from  FsDatasetImpl#volumeMap#ReplicaInfos in memory 
> is very small and accurate. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14313) Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory instead of df/du

2019-10-01 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942389#comment-16942389
 ] 

Wei-Chiu Chuang commented on HDFS-14313:


This is missing from branch-3.2 and branch-3.1 unfortunately ...

> Get hdfs used space from FsDatasetImpl#volumeMap#ReplicaInfo in memory  
> instead of df/du
> 
>
> Key: HDFS-14313
> URL: https://issues.apache.org/jira/browse/HDFS-14313
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode, performance
>Affects Versions: 2.6.0, 2.7.0, 2.8.0, 2.9.0, 3.0.0, 3.1.0
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Major
> Fix For: 2.10.0, 3.0.4, 3.3.0
>
> Attachments: HDFS-14313-branch-2.v1.patch, 
> HDFS-14313-branch-2.v2.patch, HDFS-14313.000.patch, HDFS-14313.001.patch, 
> HDFS-14313.002.patch, HDFS-14313.003.patch, HDFS-14313.004.patch, 
> HDFS-14313.005.patch, HDFS-14313.006.patch, HDFS-14313.007.patch, 
> HDFS-14313.008.patch, HDFS-14313.009.patch, HDFS-14313.010.patch, 
> HDFS-14313.011.patch, HDFS-14313.012.patch, HDFS-14313.013.patch, 
> HDFS-14313.014.patch, HDFS-14313.branch-3.0.v1.patch, 
> HDFS-14313.branch-3.0.v2.patch, HDFS-14313.branch-3.v1.patch
>
>
> There are two ways of DU/DF getting used space that are insufficient.
>  #  Running DU across lots of disks is very expensive and running all of the 
> processes at the same time creates a noticeable IO spike.
>  #  Running DF is inaccurate when the disk sharing by multiple datanode or 
> other servers.
>  Getting hdfs used space from  FsDatasetImpl#volumeMap#ReplicaInfos in memory 
> is very small and accurate. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14192) Track missing DFS operations in Statistics and StorageStatistics

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14192:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

Resolve. The failed tests doesn't appear related.
Cherry picked the commit from trunk to branch-3.2, and pushed the patch to 
branch-3.1

> Track missing DFS operations in Statistics and StorageStatistics
> 
>
> Key: HDFS-14192
> URL: https://issues.apache.org/jira/browse/HDFS-14192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14192-01.patch, HDFS-14192-02.patch, 
> HDFS-14192.branch-3.1.patch
>
>
> Track Missing DFS Operations to oblige the Read/Write Statistics and the 
> StorageStatistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14192) Track missing DFS operations in Statistics and StorageStatistics

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14192:
---
Fix Version/s: 3.2.2
   3.1.4

> Track missing DFS operations in Statistics and StorageStatistics
> 
>
> Key: HDFS-14192
> URL: https://issues.apache.org/jira/browse/HDFS-14192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14192-01.patch, HDFS-14192-02.patch, 
> HDFS-14192.branch-3.1.patch
>
>
> Track Missing DFS Operations to oblige the Read/Write Statistics and the 
> StorageStatistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14349) Edit log may be rolled more frequently than necessary with multiple Standby nodes

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14349:
---
Labels: multi-sbnn  (was: )

> Edit log may be rolled more frequently than necessary with multiple Standby 
> nodes
> -
>
> Key: HDFS-14349
> URL: https://issues.apache.org/jira/browse/HDFS-14349
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, hdfs, qjm
>Reporter: Erik Krogen
>Assignee: Ekanth Sethuramalingam
>Priority: Major
>  Labels: multi-sbnn
>
> When HDFS-14317 was fixed, we tackled the problem that in a cluster with 
> in-progress edit log tailing enabled, a Standby NameNode may _never_ roll the 
> edit logs, which can eventually cause data loss.
> Unfortunately, in the process, it was made so that if there are multiple 
> Standby NameNodes, they will all roll the edit logs at their specified 
> frequency, so the edit log will be rolled X times more frequently than they 
> should be (where X is the number of Standby NNs). This is not as bad as the 
> original bug since rolling frequently does not affect correctness or data 
> availability, but may degrade performance by creating more edit log segments 
> than necessary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14192) Track missing DFS operations in Statistics and StorageStatistics

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14192:
---
Attachment: HDFS-14192.branch-3.1.patch

> Track missing DFS operations in Statistics and StorageStatistics
> 
>
> Key: HDFS-14192
> URL: https://issues.apache.org/jira/browse/HDFS-14192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14192-01.patch, HDFS-14192-02.patch, 
> HDFS-14192.branch-3.1.patch
>
>
> Track Missing DFS Operations to oblige the Read/Write Statistics and the 
> StorageStatistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14192) Track missing DFS operations in Statistics and StorageStatistics

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14192:
---
Status: Patch Available  (was: Reopened)

> Track missing DFS operations in Statistics and StorageStatistics
> 
>
> Key: HDFS-14192
> URL: https://issues.apache.org/jira/browse/HDFS-14192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14192-01.patch, HDFS-14192-02.patch, 
> HDFS-14192.branch-3.1.patch
>
>
> Track Missing DFS Operations to oblige the Read/Write Statistics and the 
> StorageStatistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14192) Track missing DFS operations in Statistics and StorageStatistics

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reopened HDFS-14192:


Reopen for the branch-3.1 backport.

> Track missing DFS operations in Statistics and StorageStatistics
> 
>
> Key: HDFS-14192
> URL: https://issues.apache.org/jira/browse/HDFS-14192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14192-01.patch, HDFS-14192-02.patch, 
> HDFS-14192.branch-3.1.patch
>
>
> Track Missing DFS Operations to oblige the Read/Write Statistics and the 
> StorageStatistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14192) Track missing DFS operations in Statistics and StorageStatistics

2019-10-01 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942226#comment-16942226
 ] 

Wei-Chiu Chuang commented on HDFS-14192:


Commit applies cleanly in branch-3.2, but because storage policy satisfier 
isn't in 3.1, the patch has a conflict.

> Track missing DFS operations in Statistics and StorageStatistics
> 
>
> Key: HDFS-14192
> URL: https://issues.apache.org/jira/browse/HDFS-14192
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14192-01.patch, HDFS-14192-02.patch
>
>
> Track Missing DFS Operations to oblige the Read/Write Statistics and the 
> StorageStatistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14460) DFSUtil#getNamenodeWebAddr should return HTTPS address based on policy configured

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14460:
---
Fix Version/s: 3.2.2
   3.1.4

> DFSUtil#getNamenodeWebAddr should return HTTPS address based on policy 
> configured
> -
>
> Key: HDFS-14460
> URL: https://issues.apache.org/jira/browse/HDFS-14460
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: CR Hota
>Assignee: CR Hota
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14460.001.patch, HDFS-14460.002.patch, 
> HDFS-14460.003.patch, HDFS-14460.004.patch
>
>
> DFSUtil#getNamenodeWebAddr does a look-up of HTTP address irrespective of 
> policy configured. It should instead look at the policy configured and return 
> appropriate web address.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14175) EC: Native XOR decoder should reset the output buffer before using it.

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14175:
---
Fix Version/s: 3.2.2
   3.1.4

> EC: Native XOR decoder should reset the output buffer before using it.
> --
>
> Key: HDFS-14175
> URL: https://issues.apache.org/jira/browse/HDFS-14175
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, hdfs
>Affects Versions: 3.1.1
>Reporter: Surendra Singh Lilhore
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14175-01.patch, HDFS-14175-02.patch, 
> HDFS-14175-03.patch
>
>
> *jni_xor_decoder#xxx_decodeImpl()* ** should reset the outputs[0] before 
> using it. Sometime file decode give wrong data because of this. Please refer 
> the XORRawDecoder#doDecode() for the java  implementation of XOR decoder.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14202) "dfs.disk.balancer.max.disk.throughputInMBperSec" property is not working as per set value.

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14202:
---
Fix Version/s: 3.2.2
   3.1.4

> "dfs.disk.balancer.max.disk.throughputInMBperSec" property is not working as 
> per set value.
> ---
>
> Key: HDFS-14202
> URL: https://issues.apache.org/jira/browse/HDFS-14202
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: diskbalancer
>Affects Versions: 3.0.1
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14202.001.patch, HDFS-14202.002.patch, 
> HDFS-14202.003.patch, HDFS-14202.004.patch, HDFS-14202.005.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14418) Remove redundant super user priveledge checks from namenode.

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14418:
---
Fix Version/s: 3.2.2
   3.1.4

> Remove redundant super user priveledge checks from namenode.
> 
>
> Key: HDFS-14418
> URL: https://issues.apache.org/jira/browse/HDFS-14418
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14418-01.patch, HDFS-14418-02.patch, 
> HDFS-14418.branch-3.1.001.patch
>
>
> There are couple of methods that unnecessarily double checks super user 
> privileged at namenode, which can reduced to single.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14418) Remove redundant super user priveledge checks from namenode.

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-14418.

Resolution: Fixed

> Remove redundant super user priveledge checks from namenode.
> 
>
> Key: HDFS-14418
> URL: https://issues.apache.org/jira/browse/HDFS-14418
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0, 3.1.4, 3.2.2
>
> Attachments: HDFS-14418-01.patch, HDFS-14418-02.patch, 
> HDFS-14418.branch-3.1.001.patch
>
>
> There are couple of methods that unnecessarily double checks super user 
> privileged at namenode, which can reduced to single.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14418) Remove redundant super user priveledge checks from namenode.

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14418:
---
Attachment: HDFS-14418.branch-3.1.001.patch

> Remove redundant super user priveledge checks from namenode.
> 
>
> Key: HDFS-14418
> URL: https://issues.apache.org/jira/browse/HDFS-14418
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14418-01.patch, HDFS-14418-02.patch, 
> HDFS-14418.branch-3.1.001.patch
>
>
> There are couple of methods that unnecessarily double checks super user 
> privileged at namenode, which can reduced to single.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14418) Remove redundant super user priveledge checks from namenode.

2019-10-01 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942118#comment-16942118
 ] 

Wei-Chiu Chuang commented on HDFS-14418:


Patch applies cleanly in branch-3.2. There's a trivial conflict in branch-3.1.

> Remove redundant super user priveledge checks from namenode.
> 
>
> Key: HDFS-14418
> URL: https://issues.apache.org/jira/browse/HDFS-14418
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14418-01.patch, HDFS-14418-02.patch, 
> HDFS-14418.branch-3.1.001.patch
>
>
> There are couple of methods that unnecessarily double checks super user 
> privileged at namenode, which can reduced to single.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-14418) Remove redundant super user priveledge checks from namenode.

2019-10-01 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reopened HDFS-14418:


Reopen to add this in branch-3.1

> Remove redundant super user priveledge checks from namenode.
> 
>
> Key: HDFS-14418
> URL: https://issues.apache.org/jira/browse/HDFS-14418
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ayush Saxena
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14418-01.patch, HDFS-14418-02.patch, 
> HDFS-14418.branch-3.1.001.patch
>
>
> There are couple of methods that unnecessarily double checks super user 
> privileged at namenode, which can reduced to single.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14235) Handle ArrayIndexOutOfBoundsException in DataNodeDiskMetrics#slowDiskDetectionDaemon

2019-09-30 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941411#comment-16941411
 ] 

Wei-Chiu Chuang commented on HDFS-14235:


Commit applies cleanly in branch-3.1. Updated fix version

> Handle ArrayIndexOutOfBoundsException in 
> DataNodeDiskMetrics#slowDiskDetectionDaemon 
> -
>
> Key: HDFS-14235
> URL: https://issues.apache.org/jira/browse/HDFS-14235
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Surendra Singh Lilhore
>Assignee: Ranith Sardar
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-14235.000.patch, HDFS-14235.001.patch, 
> HDFS-14235.002.patch, HDFS-14235.003.patch, NPE.png, exception.png
>
>
> below code throwing exception because {{volumeIterator.next()}} called two 
> time without checking hashNext().
> {code:java}
> while (volumeIterator.hasNext()) {
>   FsVolumeSpi volume = volumeIterator.next();
>   DataNodeVolumeMetrics metrics = volumeIterator.next().getMetrics();
>   String volumeName = volume.getBaseURI().getPath();
>   metadataOpStats.put(volumeName,
>   metrics.getMetadataOperationMean());
>   readIoStats.put(volumeName, metrics.getReadIoMean());
>   writeIoStats.put(volumeName, metrics.getWriteIoMean());
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14235) Handle ArrayIndexOutOfBoundsException in DataNodeDiskMetrics#slowDiskDetectionDaemon

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14235:
---
Fix Version/s: 3.1.4

> Handle ArrayIndexOutOfBoundsException in 
> DataNodeDiskMetrics#slowDiskDetectionDaemon 
> -
>
> Key: HDFS-14235
> URL: https://issues.apache.org/jira/browse/HDFS-14235
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Surendra Singh Lilhore
>Assignee: Ranith Sardar
>Priority: Major
> Fix For: 3.3.0, 3.2.1, 3.1.4
>
> Attachments: HDFS-14235.000.patch, HDFS-14235.001.patch, 
> HDFS-14235.002.patch, HDFS-14235.003.patch, NPE.png, exception.png
>
>
> below code throwing exception because {{volumeIterator.next()}} called two 
> time without checking hashNext().
> {code:java}
> while (volumeIterator.hasNext()) {
>   FsVolumeSpi volume = volumeIterator.next();
>   DataNodeVolumeMetrics metrics = volumeIterator.next().getMetrics();
>   String volumeName = volume.getBaseURI().getPath();
>   metadataOpStats.put(volumeName,
>   metrics.getMetadataOperationMean());
>   readIoStats.put(volumeName, metrics.getReadIoMean());
>   writeIoStats.put(volumeName, metrics.getWriteIoMean());
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10648) Expose Balancer metrics through Metrics2

2019-09-30 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941389#comment-16941389
 ] 

Wei-Chiu Chuang commented on HDFS-10648:


Thanks for doing this. Without metrics HDFS-13783 isn't much useful.

> Expose Balancer metrics through Metrics2
> 
>
> Key: HDFS-10648
> URL: https://issues.apache.org/jira/browse/HDFS-10648
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: balancer  mover, metrics
>Reporter: Mark Wagner
>Assignee: Chen Zhang
>Priority: Major
>  Labels: metrics
>
> The Balancer currently prints progress information to the console. For 
> deployments that run the balancer frequently, it would be helpful to collect 
> those metrics for publishing to the available sinks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14808) EC: Improper size values for corrupt ec block in LOG

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14808:
---
Component/s: ec

> EC: Improper size values for corrupt ec block in LOG 
> -
>
> Key: HDFS-14808
> URL: https://issues.apache.org/jira/browse/HDFS-14808
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: Harshakiran Reddy
>Assignee: Ayush Saxena
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: HDFS-14808-01.patch
>
>
> If the block corruption reason is size mismatch the log. The values shown and 
> compared are ambiguous.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reopened HDFS-7134:
---

> Replication count for a block should not update till the blocks have settled 
> on Datanodes
> -
>
> Key: HDFS-7134
> URL: https://issues.apache.org/jira/browse/HDFS-7134
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 1.2.1, 2.6.0, 2.7.3
> Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP 
> Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> [hadoop@nn1 conf]$ cat /etc/redhat-release
> CentOS release 6.5 (Final)
>Reporter: gurmukh singh
>Priority: Critical
>  Labels: HDFS
> Fix For: 3.1.0
>
>
> The count for the number of replica's for a block should not change till the 
> blocks have settled on the datanodes.
> Test Case:
> Hadoop Cluster with 1 namenode and 3 datanodes.
> nn1.cluster1.com(192.168.1.70)
> dn1.cluster1.com(192.168.1.72)
> dn2.cluster1.com(192.168.1.73)
> dn3.cluster1.com(192.168.1.74)
> Cluster up and running fine with replication set to "1" for parameter 
> "dfs.replication on all nodes"
> 
> dfs.replication
> 1
> 
> To reduce the wait time, have reduced the dfs.heartbeat and recheck 
> parameters.
> on datanode2 (192.168.1.72)
> [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
> [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
> Found 1 items
> -rw-r--r--   2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2
> On Namenode
> ===
> As expected, copy was done from datanode2, one copy will go locally.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:53:16 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.73:50010]
> Can see the blocks on the data nodes disks as well under the "current" 
> directory.
> Now, shutdown datanode2(192.168.1.73) and as expected block moves to another 
> datanode to maintain a replication of 2
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:54:21 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.72:50010]
> But, now if i bring back the datanode2, and although the namenode see that 
> this block is at 3 places now and fires a invalidate command for 
> datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 
> immediately.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:56:12 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> on Datanode1 - The invalidate command has been fired immediately and the 
> block deleted.
> =
> 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010
> 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Scheduling blk_8132629811771280764_1175 file 
> /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Deleted blk_8132629811771280764_1175 at file 
> /space/disk1/current/blk_8132629811771280764
> The namenode still shows 3 replica's. even if one has been deleted, even 
> after more then 30 mins.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 14:21:27 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> This could be a dangerous, if someone remove or other 2 datanodes fail.
> On Datanode 1
> =
> Before, the datanode1 is brought back
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 28
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 23 13:54 blk_8132629811771280764_1175.meta
> 

[jira] [Resolved] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-7134.
---
Resolution: Cannot Reproduce

Resolve as cannot reproduce.

> Replication count for a block should not update till the blocks have settled 
> on Datanodes
> -
>
> Key: HDFS-7134
> URL: https://issues.apache.org/jira/browse/HDFS-7134
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 1.2.1, 2.6.0, 2.7.3
> Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP 
> Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> [hadoop@nn1 conf]$ cat /etc/redhat-release
> CentOS release 6.5 (Final)
>Reporter: gurmukh singh
>Priority: Critical
>  Labels: HDFS
> Fix For: 3.1.0
>
>
> The count for the number of replica's for a block should not change till the 
> blocks have settled on the datanodes.
> Test Case:
> Hadoop Cluster with 1 namenode and 3 datanodes.
> nn1.cluster1.com(192.168.1.70)
> dn1.cluster1.com(192.168.1.72)
> dn2.cluster1.com(192.168.1.73)
> dn3.cluster1.com(192.168.1.74)
> Cluster up and running fine with replication set to "1" for parameter 
> "dfs.replication on all nodes"
> 
> dfs.replication
> 1
> 
> To reduce the wait time, have reduced the dfs.heartbeat and recheck 
> parameters.
> on datanode2 (192.168.1.72)
> [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
> [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
> Found 1 items
> -rw-r--r--   2 hadoop supergroup 17 2014-09-23 13:33 /from_dn2
> On Namenode
> ===
> As expected, copy was done from datanode2, one copy will go locally.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:53:16 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.73:50010]
> Can see the blocks on the data nodes disks as well under the "current" 
> directory.
> Now, shutdown datanode2(192.168.1.73) and as expected block moves to another 
> datanode to maintain a replication of 2
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:54:21 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 
> 192.168.1.72:50010]
> But, now if i bring back the datanode2, and although the namenode see that 
> this block is at 3 places now and fires a invalidate command for 
> datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 
> immediately.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 13:56:12 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> on Datanode1 - The invalidate command has been fired immediately and the 
> block deleted.
> =
> 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010
> 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: 
> /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Scheduling blk_8132629811771280764_1175 file 
> /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Deleted blk_8132629811771280764_1175 at file 
> /space/disk1/current/blk_8132629811771280764
> The namenode still shows 3 replica's. even if one has been deleted, even 
> after more then 30 mins.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 
> 14:21:27 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 
> 192.168.1.72:50010, 192.168.1.73:50010]
> This could be a dangerous, if someone remove or other 2 datanodes fail.
> On Datanode 1
> =
> Before, the datanode1 is brought back
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 28
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
> -rw-rw-r-- 1 hadoop hadoop  

[jira] [Commented] (HDFS-14754) Erasure Coding : The number of Under-Replicated Blocks never reduced

2019-09-30 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941376#comment-16941376
 ] 

Wei-Chiu Chuang commented on HDFS-14754:


Too bad this one didn't land in 3.2.1 and 3.1.3. Let's make sure they get added 
to lower releases.

> Erasure Coding :  The number of Under-Replicated Blocks never reduced
> -
>
> Key: HDFS-14754
> URL: https://issues.apache.org/jira/browse/HDFS-14754
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-14754-addendum.001.patch, HDFS-14754.001.patch, 
> HDFS-14754.002.patch, HDFS-14754.003.patch, HDFS-14754.004.patch, 
> HDFS-14754.005.patch, HDFS-14754.006.patch, HDFS-14754.007.patch, 
> HDFS-14754.008.patch
>
>
> Using EC RS-3-2, 6 DN 
> We came accross a scenario where in the EC 5 blocks , same block is 
> replicated thrice and two blocks got missing
> Replicated block was not deleting and missing block is not able to ReConstruct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14754) Erasure Coding : The number of Under-Replicated Blocks never reduced

2019-09-30 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941376#comment-16941376
 ] 

Wei-Chiu Chuang edited comment on HDFS-14754 at 9/30/19 10:20 PM:
--

Too bad this one didn't land in 3.2.1 and 3.1.3. Let's make sure it gets added 
to lower releases.


was (Author: jojochuang):
Too bad this one didn't land in 3.2.1 and 3.1.3. Let's make sure they get added 
to lower releases.

> Erasure Coding :  The number of Under-Replicated Blocks never reduced
> -
>
> Key: HDFS-14754
> URL: https://issues.apache.org/jira/browse/HDFS-14754
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-14754-addendum.001.patch, HDFS-14754.001.patch, 
> HDFS-14754.002.patch, HDFS-14754.003.patch, HDFS-14754.004.patch, 
> HDFS-14754.005.patch, HDFS-14754.006.patch, HDFS-14754.007.patch, 
> HDFS-14754.008.patch
>
>
> Using EC RS-3-2, 6 DN 
> We came accross a scenario where in the EC 5 blocks , same block is 
> replicated thrice and two blocks got missing
> Replicated block was not deleting and missing block is not able to ReConstruct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14754) Erasure Coding : The number of Under-Replicated Blocks never reduced

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14754:
---
Component/s: ec

> Erasure Coding :  The number of Under-Replicated Blocks never reduced
> -
>
> Key: HDFS-14754
> URL: https://issues.apache.org/jira/browse/HDFS-14754
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Critical
> Fix For: 3.3.0
>
> Attachments: HDFS-14754-addendum.001.patch, HDFS-14754.001.patch, 
> HDFS-14754.002.patch, HDFS-14754.003.patch, HDFS-14754.004.patch, 
> HDFS-14754.005.patch, HDFS-14754.006.patch, HDFS-14754.007.patch, 
> HDFS-14754.008.patch
>
>
> Using EC RS-3-2, 6 DN 
> We came accross a scenario where in the EC 5 blocks , same block is 
> replicated thrice and two blocks got missing
> Replicated block was not deleting and missing block is not able to ReConstruct



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14528) Failover from Active to Standby Failed

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14528:
---
Labels: multi-sbnn  (was: )

> Failover from Active to Standby Failed  
> 
>
> Key: HDFS-14528
> URL: https://issues.apache.org/jira/browse/HDFS-14528
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Reporter: Ravuri Sushma sree
>Assignee: Ravuri Sushma sree
>Priority: Major
>  Labels: multi-sbnn
> Attachments: HDFS-14528.003.patch, HDFS-14528.004.patch, 
> HDFS-14528.005.patch, HDFS-14528.2.Patch, ZKFC_issue.patch
>
>
>  *In a cluster with more than one Standby namenode, manual failover throws 
> exception for some cases*
> *When trying to exectue the failover command from active to standby* 
> *._/hdfs haadmin  -failover nn1 nn2, below Exception is thrown_*
>   Operation failed: Call From X-X-X-X/X-X-X-X to Y-Y-Y-Y: failed on 
> connection exception: java.net.ConnectException: Connection refused
> This is encountered in the following cases :
>  Scenario 1 : 
> Namenodes - NN1(Active) , NN2(Standby), NN3(Standby)
> When trying to manually failover from NN1 to NN2 if NN3 is down, Exception is 
> thrown
> Scenario 2 :
>  Namenodes - NN1(Active) , NN2(Standby), NN3(Standby)
> ZKFC's -              ZKFC1,            ZKFC2,            ZKFC3
> When trying to manually failover using NN1 to NN3 if NN3's ZKFC (ZKFC3) is 
> down, Exception is thrown



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14855) client always print standbyexception info with multi standby namenode

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14855:
---
Labels: multi-sbnn  (was: )

> client always print standbyexception info with multi standby namenode
> -
>
> Key: HDFS-14855
> URL: https://issues.apache.org/jira/browse/HDFS-14855
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Shen Yinjie
>Assignee: Shen Yinjie
>Priority: Major
>  Labels: multi-sbnn
> Attachments: image-2019-09-19-20-04-54-591.png
>
>
> When cluster has more than two standby namenodes,  client executes shell will 
> print standbyexception info. May we change the log level from INFO to DEBUG,  
>  !image-2019-09-19-20-04-54-591.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14201) Ability to disallow safemode NN to become active

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14201:
---
Labels: multi-sbnn  (was: )

> Ability to disallow safemode NN to become active
> 
>
> Key: HDFS-14201
> URL: https://issues.apache.org/jira/browse/HDFS-14201
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: auto-failover
>Affects Versions: 3.1.1, 2.9.2
>Reporter: Xiao Liang
>Assignee: Xiao Liang
>Priority: Major
>  Labels: multi-sbnn
> Fix For: 3.3.0
>
> Attachments: HDFS-14201.001.patch, HDFS-14201.002.patch, 
> HDFS-14201.003.patch, HDFS-14201.004.patch, HDFS-14201.005.patch, 
> HDFS-14201.006.patch, HDFS-14201.007.patch, HDFS-14201.008.patch, 
> HDFS-14201.009.patch
>
>
> Currently with HA, Namenode in safemode can be possibly selected as active, 
> for availability of both read and write, Namenodes not in safemode are better 
> choices to become active though.
> It can take tens of minutes for a cold started Namenode to get out of 
> safemode, especially when there are large number of files and blocks in HDFS, 
> that means if a Namenode in safemode become active, the cluster will be not 
> fully functioning for quite a while, even if it can while there is some 
> Namenode not in safemode.
> The proposal here is to add an option, to allow Namenode to report itself as 
> UNHEALTHY to ZKFC, if it's in safemode, so as to only allow fully functioning 
> Namenode to become active, improving the general availability of the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14305) Serial number in BlockTokenSecretManager could overlap between different namenodes

2019-09-30 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941238#comment-16941238
 ] 

Wei-Chiu Chuang commented on HDFS-14305:


HDFS-6440 was backported into branch-2 by HDFS-14205. I'm assuming the issue in 
debate also impacts 2.10 release?

> Serial number in BlockTokenSecretManager could overlap between different 
> namenodes
> --
>
> Key: HDFS-14305
> URL: https://issues.apache.org/jira/browse/HDFS-14305
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, security
>Reporter: Chao Sun
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: multi-sbnn
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14305-007.patch, HDFS-14305.001.patch, 
> HDFS-14305.002.patch, HDFS-14305.003.patch, HDFS-14305.004.patch, 
> HDFS-14305.005.patch, HDFS-14305.006.patch
>
>
> Currently, a {{BlockTokenSecretManager}} starts with a random integer as the 
> initial serial number, and then use this formula to rotate it:
> {code:java}
> this.intRange = Integer.MAX_VALUE / numNNs;
> this.nnRangeStart = intRange * nnIndex;
> this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
>  {code}
> while {{numNNs}} is the total number of NameNodes in the cluster, and 
> {{nnIndex}} is the index of the current NameNode specified in the 
> configuration {{dfs.ha.namenodes.}}.
> However, with this approach, different NameNode could have overlapping ranges 
> for serial number. For simplicity, let's assume {{Integer.MAX_VALUE}} is 100, 
> and we have 2 NameNodes {{nn1}} and {{nn2}} in configuration. Then the ranges 
> for these two are:
> {code}
> nn1 -> [-49, 49]
> nn2 -> [1, 99]
> {code}
> This is because the initial serial number could be any negative integer.
> Moreover, when the keys are updated, the serial number will again be updated 
> with the formula:
> {code}
> this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
> {code}
> which means the new serial number could be updated to a range that belongs to 
> a different NameNode, thus increasing the chance of collision again.
> When the collision happens, DataNodes could overwrite an existing key which 
> will cause clients to fail because of {{InvalidToken}} error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14793) BlockTokenSecretManager should LOG block token range it operates on.

2019-09-30 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941235#comment-16941235
 ] 

Wei-Chiu Chuang commented on HDFS-14793:


Looks good, but it is superseded by HDFS-14305.

> BlockTokenSecretManager should LOG block token range it operates on.
> 
>
> Key: HDFS-14793
> URL: https://issues.apache.org/jira/browse/HDFS-14793
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.0
>Reporter: Konstantin Shvachko
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14793.001.patch
>
>
> At startup log enough information to identified the range of block token keys 
> for the NameNode. This should make it easier to debug issues with block 
> tokens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14378) Simplify the design of multiple NN and both logic of edit log roll and checkpoint

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14378:
---
Labels: multi-sbnn  (was: )

> Simplify the design of multiple NN and both logic of edit log roll and 
> checkpoint
> -
>
> Key: HDFS-14378
> URL: https://issues.apache.org/jira/browse/HDFS-14378
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode
>Affects Versions: 3.1.2
>Reporter: star
>Assignee: star
>Priority: Major
>  Labels: multi-sbnn
> Attachments: HDFS-14378-trunk.001.patch, HDFS-14378-trunk.002.patch, 
> HDFS-14378-trunk.003.patch, HDFS-14378-trunk.004.patch, 
> HDFS-14378-trunk.005.patch, HDFS-14378-trunk.006.patch
>
>
>       HDFS-6440 introduced a mechanism to support more than 2 NNs. It 
> implements a first-writer-win policy to avoid duplicated fsimage downloading. 
> Variable 'isPrimaryCheckPointer' is used to hold the first-writer state, with 
> which SNN will provide fsimage for ANN next time. Then we have three roles in 
> NN cluster: ANN, one primary SNN, one or more normal SNN.
>       Since HDFS-12248, there may be more than two primary SNN shortly after 
> a exception occurred. It takes care with a scenario  that SNN will not upload 
> fsimage on IOE and Interrupted exceptions. Though it will not cause any 
> further functional issues, it is inconsistent. 
>       Futher more, edit log may be rolled more frequently than necessary with 
> multiple Standby name nodes, HDFS-14349. (I'm not so sure about this, will 
> verify by unit tests or any one could point it out.)
>       Above all, I‘m wondering if we could make it simple with following 
> changes:
>  * There are only two roles:ANN, SNN
>  * ANN will roll its edit log every DFS_HA_LOGROLL_PERIOD_KEY period.
>  * ANN will select a SNN to download checkpoint.
> SNN will just do logtail and checkpoint. Then provide a servlet for fsimage 
> downloading as normal. SNN will not try to roll edit log or send checkpoint 
> request to ANN.
> In a word, ANN will be more active. Suggestions are welcomed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14646) Standby NameNode should not upload fsimage to an inappropriate NameNode.

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14646:
---
Labels: multi-sbnn  (was: )

> Standby NameNode should not upload fsimage to an inappropriate NameNode.
> 
>
> Key: HDFS-14646
> URL: https://issues.apache.org/jira/browse/HDFS-14646
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.2
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
>  Labels: multi-sbnn
> Attachments: HDFS-14646.000.patch, HDFS-14646.001.patch, 
> HDFS-14646.002.patch, HDFS-14646.003.patch, HDFS-14646.004.patch
>
>
> *Problem Description:*
>  In the multi-NameNode scenario, when a SNN uploads a FsImage, it will put 
> the image to all other NNs (whether the peer NN is an ANN or not), and even 
> if the peer NN immediately replies an error (such as 
> TransferResult.NOT_ACTIVE_NAMENODE_FAILURE, TransferResult 
> .OLD_TRANSACTION_ID_FAILURE, etc.), the local SNN will not terminate the put 
> process immediately, but will put the FsImage completely to the peer NN, and 
> will not read the peer NN's reply until the put is completed.
> Depending on the version of Jetty, this behavior can lead to different 
> consequences : 
> *1.Under Hadoop 2.7.2 (with Jetty 6.1.26)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will still be established, and the data SNN sent will be read by 
> Jetty framework itself in the peer NN side, so the SNN will insignificantly 
> send the FsImage to the peer NN continuously, causing a waste of time and 
> bandwidth. In a relatively large HDFS cluster, the size of FsImage can often 
> reach about 30GB, This is indeed a big waste.
> *2.Under newest release-3.2.0-RC1 (with Jetty 9.3.24) and trunk (with Jetty 
> 9.3.27)*
>  After peer NN called HttpServletResponse.sendError(), the underlying TCP 
> connection will be auto closed, and then SNN will directly get an "Error 
> writing request body to server" exception, as below, note this test needs a 
> relatively big FSImage (e.g. 10MB level):
> {code:java}
> 2019-08-17 03:59:25,413 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 524288 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImage(TransferFsImage.java:314)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.uploadImageFromStorage(TransferFsImage.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:277)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.StandbyCheckpointer$1.call(StandbyCheckpointer.java:272)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  2019-08-17 03:59:25,422 INFO namenode.TransferFsImage: Sending fileName: 
> /tmp/hadoop-root/dfs/name/current/fsimage_3364240, fileSize: 
> 9864721. Sent total: 851968 bytes. Size of last segment intended to send: 
> 4096 bytes.
>  java.io.IOException: Error writing request body to server
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(HttpURLConnection.java:3587)
>  at 
> sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(HttpURLConnection.java:3570)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.copyFileToStream(TransferFsImage.java:396)
>  at 
> org.apache.hadoop.hdfs.server.namenode.TransferFsImage.writeFileToPutRequest(TransferFsImage.java:340)
>   {code}
>                   
> *Solution:*
>  A standby NameNode should not upload fsimage to an inappropriate NameNode, 
> when he plans to put a FsImage to the peer NN, he need to check whether he 
> really need to put it at this time.
> In detail, local SNN should establish an HTTP connection with the peer NN, 
> send the put request, and then immediately read the response (this is the key 
> point). If the peer 

[jira] [Updated] (HDFS-14305) Serial number in BlockTokenSecretManager could overlap between different namenodes

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14305:
---
Labels: multi-sbnn  (was: )

> Serial number in BlockTokenSecretManager could overlap between different 
> namenodes
> --
>
> Key: HDFS-14305
> URL: https://issues.apache.org/jira/browse/HDFS-14305
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, security
>Reporter: Chao Sun
>Assignee: Xiaoqiao He
>Priority: Major
>  Labels: multi-sbnn
> Fix For: 3.0.4, 3.3.0, 3.2.1, 3.1.3
>
> Attachments: HDFS-14305-007.patch, HDFS-14305.001.patch, 
> HDFS-14305.002.patch, HDFS-14305.003.patch, HDFS-14305.004.patch, 
> HDFS-14305.005.patch, HDFS-14305.006.patch
>
>
> Currently, a {{BlockTokenSecretManager}} starts with a random integer as the 
> initial serial number, and then use this formula to rotate it:
> {code:java}
> this.intRange = Integer.MAX_VALUE / numNNs;
> this.nnRangeStart = intRange * nnIndex;
> this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
>  {code}
> while {{numNNs}} is the total number of NameNodes in the cluster, and 
> {{nnIndex}} is the index of the current NameNode specified in the 
> configuration {{dfs.ha.namenodes.}}.
> However, with this approach, different NameNode could have overlapping ranges 
> for serial number. For simplicity, let's assume {{Integer.MAX_VALUE}} is 100, 
> and we have 2 NameNodes {{nn1}} and {{nn2}} in configuration. Then the ranges 
> for these two are:
> {code}
> nn1 -> [-49, 49]
> nn2 -> [1, 99]
> {code}
> This is because the initial serial number could be any negative integer.
> Moreover, when the keys are updated, the serial number will again be updated 
> with the formula:
> {code}
> this.serialNo = (this.serialNo % intRange) + (nnRangeStart);
> {code}
> which means the new serial number could be updated to a range that belongs to 
> a different NameNode, thus increasing the chance of collision again.
> When the collision happens, DataNodes could overwrite an existing key which 
> will cause clients to fail because of {{InvalidToken}} error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14883) NPE when the second SNN is starting

2019-09-30 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14883:
---
Labels: multi-sbnn  (was: )

> NPE when the second SNN is starting
> ---
>
> Key: HDFS-14883
> URL: https://issues.apache.org/jira/browse/HDFS-14883
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
>  Labels: multi-sbnn
>
>  
> {{| WARN | qtp79782883-47 | /imagetransfer | ServletHandler.java:632
>  java.io.IOException: PutImage failed. java.lang.NullPointerException
>  at 
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.validateRequest(ImageServlet.java:198)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:485)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14883) NPE when the second SNN is starting

2019-09-30 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941164#comment-16941164
 ] 

Wei-Chiu Chuang commented on HDFS-14883:


Assuming this is related to the multi-standby-nn feature, I link HDFS-6440 to 
this jira.

Please add the affect version?

> NPE when the second SNN is starting
> ---
>
> Key: HDFS-14883
> URL: https://issues.apache.org/jira/browse/HDFS-14883
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ranith Sardar
>Assignee: Ranith Sardar
>Priority: Major
>
>  
> {{| WARN | qtp79782883-47 | /imagetransfer | ServletHandler.java:632
>  java.io.IOException: PutImage failed. java.lang.NullPointerException
>  at 
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.validateRequest(ImageServlet.java:198)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ImageServlet.doPut(ImageServlet.java:485)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>  at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-14564) Add libhdfs APIs for readFully; add readFully to ByteBufferPositionedReadable

2019-09-27 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-14564.

Fix Version/s: 3.3.0
   Resolution: Fixed

Thanks [~stakiar] for the patch and [~smeng] for review!

> Add libhdfs APIs for readFully; add readFully to ByteBufferPositionedReadable
> -
>
> Key: HDFS-14564
> URL: https://issues.apache.org/jira/browse/HDFS-14564
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, libhdfs, native
>Reporter: Sahil Takiar
>Assignee: Sahil Takiar
>Priority: Major
> Fix For: 3.3.0
>
>
> Splitting this out from HDFS-14478
> The {{PositionedReadable#readFully}} APIs have existed for a while, but have 
> never been exposed via libhdfs.
> HDFS-3246 added a new interface called {{ByteBufferPositionedReadable}} that 
> provides a {{ByteBuffer}} version of {{PositionedReadable}}, but it does not 
> contain a {{readFully}} method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13901) INode access time is ignored because of race between open and rename

2019-09-24 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937047#comment-16937047
 ] 

Wei-Chiu Chuang commented on HDFS-13901:


Thanks for the update and the awesome benchmark result.

I think the fix itself is good, and easy to understand.

Should we remove the following comment? Looks to me this is the exact same bug 
fixed here.
{code}
   * XXX: Races can still occur even after resolving the path again.
   * For example:
   *
   * 
   *   Get the block location for "/a/b"
   *   Rename "/a/b" to "/c/b"
   *   The second resolution still points to "/a/b", which is
   *   wrong.
   * 
   *
   * The behavior is incorrect but consistent with the one before
   * HDFS-7463. A better fix is to change the edit log of SetTime to
   * use inode id instead of a path.
   */
{code}


Test:
optional but you can replace the following
{code}
  OutputStream out = hdfs.create(new Path(src));
  out.write("hello".getBytes());
  out.close();
{code}
with 
{code}
DFSTestUtil.createFile(hdfsfs, new Path(src), len,
  (short) 1, 0xFEED)
{code}
Can you use something else other than sleep 1ms? Like CountDownLatch or 
semaphore? Using sleep 1ms to control order of threads is almost always going 
to create flaky tests.

> INode access time is ignored because of race between open and rename
> 
>
> Key: HDFS-13901
> URL: https://issues.apache.org/jira/browse/HDFS-13901
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Major
> Attachments: HDFS-13901.000.patch, HDFS-13901.001.patch, 
> HDFS-13901.002.patch, HDFS-13901.003.patch, HDFS-13901.004.patch
>
>
> That's because in getBlockLocations there is a gap between readUnlock and 
> re-fetch write lock (to update access time). If a rename operation occurs in 
> the gap, the update of access time will be ignored. We can calculate new path 
> from the inode and use the new path to update access time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-14868) RBF: Fix typo in TestRouterQuota

2019-09-24 Thread Wei-Chiu Chuang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-14868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-14868:
---
Fix Version/s: 3.3.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Pushed 001 to trunk.
Thanks [~LiJinglun] for the patch and [~ayushtkn] for the review.

> RBF: Fix typo in TestRouterQuota
> 
>
> Key: HDFS-14868
> URL: https://issues.apache.org/jira/browse/HDFS-14868
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Trivial
> Fix For: 3.3.0
>
> Attachments: HDFS-14868.001.patch
>
>
> There is a typo in TestRouterQuota, see the patch for detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14778) BlockManager findAndMarkBlockAsCorrupt adds block to the map if the Storage state is failed

2019-09-23 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936272#comment-16936272
 ] 

Wei-Chiu Chuang commented on HDFS-14778:


I've definitely seen something similar happened which was fixed by HDFS-9958. 
So I was curious why is this different from HDFS-9958.
Also, the test doesn't work. If I remove the fix, the test passes too.

> BlockManager findAndMarkBlockAsCorrupt adds block to the map if the Storage 
> state is failed
> ---
>
> Key: HDFS-14778
> URL: https://issues.apache.org/jira/browse/HDFS-14778
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: hemanthboyina
>Assignee: hemanthboyina
>Priority: Major
> Attachments: HDFS-14778.001.patch, HDFS-14778.002.patch, 
> HDFS-14778.003.patch
>
>
> Should not mark the block as corrupt if the storage state is failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14868) RBF: Fix typo in TestRouterQuota

2019-09-23 Thread Wei-Chiu Chuang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936260#comment-16936260
 ] 

Wei-Chiu Chuang commented on HDFS-14868:


+1

> RBF: Fix typo in TestRouterQuota
> 
>
> Key: HDFS-14868
> URL: https://issues.apache.org/jira/browse/HDFS-14868
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Trivial
> Attachments: HDFS-14868.001.patch
>
>
> There is a typo in TestRouterQuota, see the patch for detail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >