[jira] [Commented] (HDDS-2287) Move ozone source code to apache/hadoop-ozone from apache/hadoop

2019-10-23 Thread Vinod Kumar Vavilapalli (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957761#comment-16957761
 ] 

Vinod Kumar Vavilapalli commented on HDDS-2287:
---

Is there a concrete proposal on which exact modules move and which don't? 
Scanned the wiki page and don't find it.

Color me completely disconnected, but I was a bit surprised to see on 
HADOOP-16654 that HDDS also moves out. When it all started, didn't we talk 
about possibility of multiple storage systems on top of HDDS and Ozone happens 
to be one of them? If so, why do we tie them together this way? Shouldn't HDDS 
stay in the 'core' ? Or are Ozone & HDDS tightly coupled now?

> Move ozone source code to apache/hadoop-ozone from apache/hadoop
> 
>
> Key: HDDS-2287
> URL: https://issues.apache.org/jira/browse/HDDS-2287
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
>
> *This issue is created to use the assigned number for any technical commits 
> to make it easy to follow the root reason of the commit...*
>  
> As discussed and voted on the mailing lists, Apache Hadoop Ozone source code 
> will be removed from the hadoop trunk and stored in a separated repository.
>  
> Original discussion is here:
> [https://lists.apache.org/thread.html/ef01b7def94ba58f746875999e419e10645437423ab9af19b32821e7@%3Chdfs-dev.hadoop.apache.org%3E]
> (It's started as a discussion but as everybody started to vote it's finished 
> with a call to a lazy consensus vote)
>  
> Technical proposal is shared on the wiki: 
> [https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Ozone+source+tree+split]
>  
> Discussed on the community meeting: 
> [https://cwiki.apache.org/confluence/display/HADOOP/2019-09-30+Meeting+notes]
>  
> Which is shared on the mailing list to get more feedback: 
> [https://lists.apache.org/thread.html/ed608c708ea302675ae5e39636ed73613f47a93c2ddfbd3c9e24dbae@%3Chdfs-dev.hadoop.apache.org%3E]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13547) Add ingress port based sasl resolver

2018-11-30 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-13547:
---
Fix Version/s: (was: 3.1.1)

> Add ingress port based sasl resolver
> 
>
> Key: HDFS-13547
> URL: https://issues.apache.org/jira/browse/HDFS-13547
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: security
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13547.001.patch, HDFS-13547.002.patch, 
> HDFS-13547.003.patch, HDFS-13547.004.patch
>
>
> This Jira extends the SASL properties resolver interface to take an ingress 
> port parameter, and also adds an implementation based on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13547) Add ingress port based sasl resolver

2018-11-30 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705271#comment-16705271
 ] 

Vinod Kumar Vavilapalli commented on HDFS-13547:


[~vagarychen], 3.1.1 already released a while ago, so this shouldn't land in 
branch-3.1.1 nor should this JIRA have the fix-version 3.1.1.

> Add ingress port based sasl resolver
> 
>
> Key: HDFS-13547
> URL: https://issues.apache.org/jira/browse/HDFS-13547
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: security
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: HDFS-13547.001.patch, HDFS-13547.002.patch, 
> HDFS-13547.003.patch, HDFS-13547.004.patch
>
>
> This Jira extends the SASL properties resolver interface to take an ingress 
> port parameter, and also adds an implementation based on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-13547) Add ingress port based sasl resolver

2018-11-29 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reopened HDFS-13547:


Oh, and this never made it to branch-3 either. [~vagarychen], I am reopening 
this, please put this in branch-3 too.

> Add ingress port based sasl resolver
> 
>
> Key: HDFS-13547
> URL: https://issues.apache.org/jira/browse/HDFS-13547
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: security
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13547.001.patch, HDFS-13547.002.patch, 
> HDFS-13547.003.patch, HDFS-13547.004.patch
>
>
> This Jira extends the SASL properties resolver interface to take an ingress 
> port parameter, and also adds an implementation based on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13547) Add ingress port based sasl resolver

2018-11-29 Thread Vinod Kumar Vavilapalli (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-13547:
---
Fix Version/s: (was: 3.1.1)
   3.2.0

I just checked the branches, this never made it to 3.1.1 even though the 
fix-version is set so.

It's only in branch-3.2, branch-3.2.0 and trunk.

Release-notes for 3.1.1 (which is already released) are broken, but it is what 
it is.

Editing the fix-version.

> Add ingress port based sasl resolver
> 
>
> Key: HDFS-13547
> URL: https://issues.apache.org/jira/browse/HDFS-13547
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: security
>Reporter: Chen Liang
>Assignee: Chen Liang
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: HDFS-13547.001.patch, HDFS-13547.002.patch, 
> HDFS-13547.003.patch, HDFS-13547.004.patch
>
>
> This Jira extends the SASL properties resolver interface to take an ingress 
> port parameter, and also adds an implementation based on this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-861) TestNodeManager unit tests are broken

2018-11-21 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694967#comment-16694967
 ] 

Vinod Kumar Vavilapalli commented on HDDS-861:
--

While you are at it, can you rename the test to be {{TestSCMNodeManager}}? We 
have a {{TestNodeManager}} on the YARN side :)

> TestNodeManager unit tests are broken
> -
>
> Key: HDDS-861
> URL: https://issues.apache.org/jira/browse/HDDS-861
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Affects Versions: 0.4.0
>Reporter: Shashikant Banerjee
>Priority: Major
> Fix For: 0.4.0
>
>
> Many of the tests are failing with NullPointerException
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.updateNodeStat(SCMNodeManager.java:195)
> at 
> org.apache.hadoop.hdds.scm.node.SCMNodeManager.register(SCMNodeManager.java:276)
> at 
> org.apache.hadoop.hdds.scm.TestUtils.createRandomDatanodeAndRegister(TestUtils.java:147)
> at 
> org.apache.hadoop.hdds.scm.node.TestNodeManager.testScmHeartbeat(TestNodeManager.java:152)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at 
> org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:168)
> at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> at org.junit.runner.JUnitCore.run(JUnitCore.java:160)
> at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
> at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
> at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
> at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

2018-05-29 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16494368#comment-16494368
 ] 

Vinod Kumar Vavilapalli commented on HDFS-13596:


[~hanishakoneru], how does this relate to HDFS-11096? Can you leave a comment 
on that JIRA? Thanks!

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -
>
> Key: HDFS-13596
> URL: https://issues.apache.org/jira/browse/HDFS-13596
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Hanisha Koneru
>Priority: Critical
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails 
> while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to 
> editLogs (before finalizing the upgrade) is the pre-upgrade layout version 
> (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout 
> version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the 
> upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old 
> layout version from the editLog file. When parsing the transactions, it 
> assumes that the transactions are also from the previous layout and hence 
> skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and 
> leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected 
> length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.(RetryCache.java:86)
>  at 
> org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.(RetryCache.java:163)
>  at 
> org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception 
> loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less 
> than the current value (=16389), where newValue=16388
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:910)
>  at 
> 

[jira] [Updated] (HDFS-13218) Log audit event only used last EC policy name when add multiple policies from file

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-13218:
---
Fix Version/s: (was: 3.0.1)
   (was: 3.1.0)

Removing 3.1.0 fix-version from all JIRAs which are Invalid / Won't Fix / 
Duplicate / Cannot Reproduce.

> Log audit event only used last EC policy name when add multiple policies from 
> file 
> ---
>
> Key: HDFS-13218
> URL: https://issues.apache.org/jira/browse/HDFS-13218
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: erasure-coding
>Affects Versions: 3.1.0
>Reporter: liaoyuxiangqin
>Priority: Major
>
> When i read the addErasureCodingPolicies() of FSNamesystem class in namenode, 
> i found the following code only used last ec policy name for  logAuditEvent, 
> i think this audit log can't track whole policies for the add multiple 
> erasure coding policies to the ErasureCodingPolicyManager. Thanks.
> {code:java|title=FSNamesystem.java|borderStyle=solid}
> try {
>   checkOperation(OperationCategory.WRITE);
>   checkNameNodeSafeMode("Cannot add erasure coding policy");
>   for (ErasureCodingPolicy policy : policies) {
> try {
>   ErasureCodingPolicy newPolicy =
>   FSDirErasureCodingOp.addErasureCodingPolicy(this, policy,
>   logRetryCache);
>   addECPolicyName = newPolicy.getName();
>   responses.add(new AddErasureCodingPolicyResponse(newPolicy));
> } catch (HadoopIllegalArgumentException e) {
>   responses.add(new AddErasureCodingPolicyResponse(policy, e));
> }
>   }
>   success = true;
>   return responses.toArray(new AddErasureCodingPolicyResponse[0]);
> } finally {
>   writeUnlock(operationName);
>   if (success) {
> getEditLog().logSync();
>   }
>   logAuditEvent(success, operationName,addECPolicyName, null, null);
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-10675) [READ] Datanode support to read from external stores.

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved HDFS-10675.

Resolution: Fixed

> [READ] Datanode support to read from external stores.
> -
>
> Key: HDFS-10675
> URL: https://issues.apache.org/jira/browse/HDFS-10675
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HDFS-10675-HDFS-9806.001.patch, 
> HDFS-10675-HDFS-9806.002.patch, HDFS-10675-HDFS-9806.003.patch, 
> HDFS-10675-HDFS-9806.004.patch, HDFS-10675-HDFS-9806.005.patch, 
> HDFS-10675-HDFS-9806.006.patch, HDFS-10675-HDFS-9806.007.patch, 
> HDFS-10675-HDFS-9806.008.patch, HDFS-10675-HDFS-9806.009.patch
>
>
> This JIRA introduces a new {{PROVIDED}} {{StorageType}} to represent external 
> stores, along with enabling the Datanode to read from such stores using a 
> {{ProvidedReplica}} and a {{ProvidedVolume}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-11190) [READ] Namenode support for data stored in external stores.

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved HDFS-11190.

Resolution: Fixed

> [READ] Namenode support for data stored in external stores.
> ---
>
> Key: HDFS-11190
> URL: https://issues.apache.org/jira/browse/HDFS-11190
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HDFS-11190-HDFS-9806.001.patch, 
> HDFS-11190-HDFS-9806.002.patch, HDFS-11190-HDFS-9806.003.patch, 
> HDFS-11190-HDFS-9806.004.patch
>
>
> The goal of this JIRA is to enable the Namenode to know about blocks that are 
> in {{PROVIDED}} stores and are not necessarily stored on any Datanodes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-10675) [READ] Datanode support to read from external stores.

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reopened HDFS-10675:


Resolving as Fixed instead of as Resolved per our conventions.

> [READ] Datanode support to read from external stores.
> -
>
> Key: HDFS-10675
> URL: https://issues.apache.org/jira/browse/HDFS-10675
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HDFS-10675-HDFS-9806.001.patch, 
> HDFS-10675-HDFS-9806.002.patch, HDFS-10675-HDFS-9806.003.patch, 
> HDFS-10675-HDFS-9806.004.patch, HDFS-10675-HDFS-9806.005.patch, 
> HDFS-10675-HDFS-9806.006.patch, HDFS-10675-HDFS-9806.007.patch, 
> HDFS-10675-HDFS-9806.008.patch, HDFS-10675-HDFS-9806.009.patch
>
>
> This JIRA introduces a new {{PROVIDED}} {{StorageType}} to represent external 
> stores, along with enabling the Datanode to read from such stores using a 
> {{ProvidedReplica}} and a {{ProvidedVolume}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDFS-11190) [READ] Namenode support for data stored in external stores.

2018-03-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli reopened HDFS-11190:


Resolving as Fixed instead of as Resolved per our conventions.

> [READ] Namenode support for data stored in external stores.
> ---
>
> Key: HDFS-11190
> URL: https://issues.apache.org/jira/browse/HDFS-11190
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Major
> Fix For: 3.1.0
>
> Attachments: HDFS-11190-HDFS-9806.001.patch, 
> HDFS-11190-HDFS-9806.002.patch, HDFS-11190-HDFS-9806.003.patch, 
> HDFS-11190-HDFS-9806.004.patch
>
>
> The goal of this JIRA is to enable the Namenode to know about blocks that are 
> in {{PROVIDED}} stores and are not necessarily stored on any Datanodes. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13106) Need to exercise all HDFS APIs for EC

2018-02-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-13106:
---

Committer of this patch, please set the *Fix Version/s* for this JIRA.

> Need to exercise all HDFS APIs for EC
> -
>
> Key: HDFS-13106
> URL: https://issues.apache.org/jira/browse/HDFS-13106
> Project: Hadoop HDFS
>  Issue Type: Test
>  Components: hdfs
>Affects Versions: 3.0.0
>Reporter: Haibo Yan
>Assignee: Haibo Yan
>Priority: Major
> Attachments: HDFS-13106.001.patch, HDFS-13106.002.patch, 
> HDFS-13106.003.patch
>
>
> Exercise FileSystem API to make sure all APIs works as expected under Erasure 
> Coding feature enabled



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12917) Fix description errors in testErasureCodingConf.xml

2018-02-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-12917:
---

Committer of this patch, please set the *Fix Version/s* for this JIRA.

> Fix description errors in testErasureCodingConf.xml
> ---
>
> Key: HDFS-12917
> URL: https://issues.apache.org/jira/browse/HDFS-12917
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: chencan
>Assignee: chencan
>Priority: Major
> Attachments: HADOOP-12917.002.patch, HADOOP-12917.patch
>
>
> In testErasureCodingConf.xml,there are two case's description may be 
> "getPolicy : get EC policy information at specified path, whick have an EC 
> Policy".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12665) [AliasMap] Create a version of the AliasMap that runs in memory in the Namenode (leveldb)

2017-12-18 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295125#comment-16295125
 ] 

Vinod Kumar Vavilapalli commented on HDFS-12665:


[~virajith], the JIRA is missing fix-version, please set it.

> [AliasMap] Create a version of the AliasMap that runs in memory in the 
> Namenode (leveldb)
> -
>
> Key: HDFS-12665
> URL: https://issues.apache.org/jira/browse/HDFS-12665
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
> Attachments: HDFS-12665-HDFS-9806.001.patch, 
> HDFS-12665-HDFS-9806.002.patch, HDFS-12665-HDFS-9806.003.patch, 
> HDFS-12665-HDFS-9806.004.patch, HDFS-12665-HDFS-9806.005.patch, 
> HDFS-12665-HDFS-9806.006.patch, HDFS-12665-HDFS-9806.007.patch, 
> HDFS-12665-HDFS-9806.008.patch, HDFS-12665-HDFS-9806.009.patch, 
> HDFS-12665-HDFS-9806.010.patch, HDFS-12665-HDFS-9806.011.patch, 
> HDFS-12665-HDFS-9806.012.patch
>
>
> The design of Provided Storage requires the use of an AliasMap to manage the 
> mapping between blocks of files on the local HDFS and ranges of files on a 
> remote storage system. To reduce load from the Namenode, this can be done 
> using a pluggable external service (e.g. AzureTable, Cassandra, Ratis). 
> However, to aide adoption and ease of deployment, we propose an in memory 
> version.
> This AliasMap will be a wrapper around LevelDB (already a dependency from the 
> Timeline Service) and use protobuf for the key (blockpool, blockid, and 
> genstamp) and the value (url, offset, length, nonce). The in memory service 
> will also have a configurable port on which it will listen for updates from 
> Storage Policy Satisfier (SPS) Coordinating Datanodes (C-DN).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12591) [READ] Implement LevelDBFileRegionFormat

2017-12-18 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295123#comment-16295123
 ] 

Vinod Kumar Vavilapalli commented on HDFS-12591:


[~virajith], the JIRA is missing fix-version, please set it.

> [READ] Implement LevelDBFileRegionFormat
> 
>
> Key: HDFS-12591
> URL: https://issues.apache.org/jira/browse/HDFS-12591
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Ewan Higgs
>Assignee: Ewan Higgs
>Priority: Minor
> Attachments: HDFS-12591-HDFS-9806.001.patch, 
> HDFS-12591-HDFS-9806.002.patch, HDFS-12591-HDFS-9806.003.patch, 
> HDFS-12591-HDFS-9806.004.patch, HDFS-12591-HDFS-9806.005.patch, 
> HDFS-12591-HDFS-9806.006.patch, HDFS-12591-HDFS-9806.007.patch
>
>
> The existing work for HDFS-9806 uses an implementation of the {{FileRegion}} 
> read from a csv file. This is good for testability and diagnostic purposes, 
> but it is not very efficient for larger systems.
> There should be a version that is similar to the {{TextFileRegionFormat}} 
> that instead uses LevelDB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12713) [READ] Refactor FileRegion and BlockAliasMap to separate out HDFS metadata and PROVIDED storage metadata

2017-12-18 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295118#comment-16295118
 ] 

Vinod Kumar Vavilapalli commented on HDFS-12713:


[~virajith], couple of things
 - Major: The JIRA is missing fix-version, please set it. In general, you 
should set the fix-version at commit time.
 - Minor: Once you've reviewed the patch, you can set the Reviewed flag, an 
option that pops up when you are resolving the JIRA.

> [READ] Refactor FileRegion and BlockAliasMap to separate out HDFS metadata 
> and PROVIDED storage metadata
> 
>
> Key: HDFS-12713
> URL: https://issues.apache.org/jira/browse/HDFS-12713
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Virajith Jalaparti
>Assignee: Ewan Higgs
> Attachments: HDFS-12713-HDFS-9806.001.patch, 
> HDFS-12713-HDFS-9806.002.patch, HDFS-12713-HDFS-9806.003.patch, 
> HDFS-12713-HDFS-9806.004.patch, HDFS-12713-HDFS-9806.005.patch, 
> HDFS-12713-HDFS-9806.006.patch, HDFS-12713-HDFS-9806.007.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11737) Backport HDFS-7964 to branch-2.7: add support for async edit logging

2017-07-21 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-11737:
---
Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

Thanks [~brahmareddy] for that ref.

Closing this as Won't Fix.

[~zhz], please reopen this if you want this in 2.7.5.

> Backport HDFS-7964 to branch-2.7: add support for async edit logging
> 
>
> Key: HDFS-11737
> URL: https://issues.apache.org/jira/browse/HDFS-11737
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>Priority: Critical
> Attachments: HDFS-11737-branch-2.7.00.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11576) Block recovery will fail indefinitely if recovery time > heartbeat interval

2017-07-20 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-11576:
---
Status: Open  (was: Patch Available)

Canceling patch. [~lukmajercak], can you address [~shv]'s review comments above?

> Block recovery will fail indefinitely if recovery time > heartbeat interval
> ---
>
> Key: HDFS-11576
> URL: https://issues.apache.org/jira/browse/HDFS-11576
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs, namenode
>Affects Versions: 3.0.0-alpha2, 3.0.0-alpha1, 2.7.3, 2.7.2, 2.7.1
>Reporter: Lukas Majercak
>Assignee: Lukas Majercak
>Priority: Critical
>  Labels: release-blocker
> Attachments: HDFS-11576.001.patch, HDFS-11576.002.patch, 
> HDFS-11576.003.patch, HDFS-11576.004.patch, HDFS-11576.005.patch, 
> HDFS-11576.006.patch, HDFS-11576.007.patch, HDFS-11576.repro.patch
>
>
> Block recovery will fail indefinitely if the time to recover a block is 
> always longer than the heartbeat interval. Scenario:
> 1. DN sends heartbeat 
> 2. NN sends a recovery command to DN, recoveryID=X
> 3. DN starts recovery
> 4. DN sends another heartbeat
> 5. NN sends a recovery command to DN, recoveryID=X+1
> 6. DN calls commitBlockSyncronization after succeeding with first recovery to 
> NN, which fails because X < X+1
> ... 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11737) Backport HDFS-7964 to branch-2.7: add support for async edit logging

2017-07-20 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095214#comment-16095214
 ] 

Vinod Kumar Vavilapalli commented on HDFS-11737:


[~zhz], this looks like a big patch and a major change. Do you think this 
should block 2.7.4 or can we de-risk 2.7.4 by skipping this?

> Backport HDFS-7964 to branch-2.7: add support for async edit logging
> 
>
> Key: HDFS-11737
> URL: https://issues.apache.org/jira/browse/HDFS-11737
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: ha, namenode
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>Priority: Critical
> Attachments: HDFS-11737-branch-2.7.00.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11742) Improve balancer usability after HDFS-8818

2017-07-20 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16095212#comment-16095212
 ] 

Vinod Kumar Vavilapalli commented on HDFS-11742:


[~kihwal], do you think this is a blocker for 2.7.4 - it's marked as such?

> Improve balancer usability after HDFS-8818
> --
>
> Key: HDFS-11742
> URL: https://issues.apache.org/jira/browse/HDFS-11742
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Blocker
>  Labels: release-blocker
> Attachments: balancer2.8.png, balancer_fix.png, 
> HDFS-11742.branch-2.8.patch, HDFS-11742.branch-2.patch, 
> HDFS-11742.trunk.patch, HDFS-11742.v2.trunk.patch, replaceBlockNumOps-8w.jpg
>
>
> We ran 2.8 balancer with HDFS-8818 on a 280-node and a 2,400-node cluster. In 
> both cases, it would hang forever after two iterations. The two iterations 
> were also moving things at a significantly lower rate. The hang itself is 
> fixed by HDFS-11377, but the design limitation remains, so the balancer 
> throughput ends up actually lower.
> Instead of reverting HDFS-8188 as originally suggested, I am making a small 
> change to make it less error prone and more usable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11924) FSPermissionChecker.checkTraverse doesn't pass FsAction access properly

2017-06-07 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-11924:
---
Target Version/s: 2.8.2
   Fix Version/s: (was: 2.8.1)

Please leave the fix-version field alone for a committer to set it at commit 
time. Updating it myself for now.

> FSPermissionChecker.checkTraverse doesn't pass FsAction access properly
> ---
>
> Key: HDFS-11924
> URL: https://issues.apache.org/jira/browse/HDFS-11924
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: security
>Affects Versions: 2.8.0
>Reporter: Zsombor Gegesy
>  Labels: hdfs, hdfspermission
> Attachments: 
> 0001-HDFS-11924-Pass-FsAction-to-the-external-AccessContr.patch
>
>
> In 2.7.1, during file access check, the AccessControlEnforcer is called with 
> the access parameter filled with FsAction values.
> A thread dump in this case:
> {code}
>   FSPermissionChecker.checkPermission(INodesInPath, boolean, FsAction, 
> FsAction, FsAction, FsAction, boolean) line: 189   
>   FSDirectory.checkPermission(FSPermissionChecker, INodesInPath, boolean, 
> FsAction, FsAction, FsAction, FsAction, boolean) line: 1698 
>   FSDirectory.checkPermission(FSPermissionChecker, INodesInPath, boolean, 
> FsAction, FsAction, FsAction, FsAction) line: 1682  
>   FSDirectory.checkPathAccess(FSPermissionChecker, INodesInPath, 
> FsAction) line: 1656 
>   FSNamesystem.appendFileInternal(FSPermissionChecker, INodesInPath, 
> String, String, boolean, boolean) line: 2668 
>   FSNamesystem.appendFileInt(String, String, String, boolean, boolean) 
> line: 2985 
>   FSNamesystem.appendFile(String, String, String, EnumSet, 
> boolean) line: 2952
>   NameNodeRpcServer.append(String, String, EnumSetWritable) 
> line: 653 
>   ClientNamenodeProtocolServerSideTranslatorPB.append(RpcController, 
> ClientNamenodeProtocolProtos$AppendRequestProto) line: 421   
>   
> ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(Descriptors$MethodDescriptor,
>  RpcController, Message) line: not available  
>   ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(RPC$Server, String, 
> Writable, long) line: 616  
>   ProtobufRpcEngine$Server(RPC$Server).call(RPC$RpcKind, String, 
> Writable, long) line: 969
>   Server$Handler$1.run() line: 2049   
>   Server$Handler$1.run() line: 2045   
>   AccessController.doPrivileged(PrivilegedExceptionAction, 
> AccessControlContext) line: not available [native method]   
>   Subject.doAs(Subject, PrivilegedExceptionAction) line: 422   
>   UserGroupInformation.doAs(PrivilegedExceptionAction) line: 1657  
> {code}
> However, in 2.8.0 this value is changed to null, because in 
> FSPermissionChecker.checkTraverse(FSPermissionChecker pc, INodesInPath iip, 
> boolean resolveLink) couldn't pass the required information, so it's simply 
> use 'null'.
> This is a regression between 2.7.1 and 2.8.0, because external 
> AccessControlEnforcer couldn't work properly



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11742) Improve balancer usability after HDFS-8818

2017-06-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035658#comment-16035658
 ] 

Vinod Kumar Vavilapalli commented on HDFS-11742:


bq. If 2.8.1 is put up for vote with this, I will have to -1 the release.
bq. It will affect many users, if it is included in a release as is. I will -1 
the release if the issue is not properly addressed.
I'm pushing for the next 2.8 maint release as well as 2.7.x. [~kihwal] / 
[~szetszwo], can you please help get ourselves to a convergence? Thanks.

> Improve balancer usability after HDFS-8818
> --
>
> Key: HDFS-11742
> URL: https://issues.apache.org/jira/browse/HDFS-11742
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Blocker
>  Labels: release-blocker
> Attachments: balancer2.8.png, HDFS-11742.branch-2.8.patch, 
> HDFS-11742.branch-2.patch, HDFS-11742.trunk.patch, HDFS-11742.v2.trunk.patch
>
>
> We ran 2.8 balancer with HDFS-8818 on a 280-node and a 2,400-node cluster. In 
> both cases, it would hang forever after two iterations. The two iterations 
> were also moving things at a significantly lower rate. The hang itself is 
> fixed by HDFS-11377, but the design limitation remains, so the balancer 
> throughput ends up actually lower.
> Instead of reverting HDFS-8188 as originally suggested, I am making a small 
> change to make it less error prone and more usable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11743) Revert HDFS-7933 from branch-2.7 (fsck reporting decommissioning replicas)

2017-06-02 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035648#comment-16035648
 ] 

Vinod Kumar Vavilapalli commented on HDFS-11743:


[~zhz], I'm pushing for a 2.7.4, can you explain why it has to be reverted?



> Revert HDFS-7933 from branch-2.7 (fsck reporting decommissioning replicas)
> --
>
> Key: HDFS-11743
> URL: https://issues.apache.org/jira/browse/HDFS-11743
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>Priority: Blocker
>  Labels: release-blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10692) Point JDiff base version for HDFS from 2.6.0 to 2.7.2

2016-08-22 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10692:
---
   Resolution: Fixed
Fix Version/s: 3.0.0-alpha1
   2.8.0
   Status: Resolved  (was: Patch Available)

Fixed the whitespace issue as part of the commit.

Pushed this to trunk, branch-3.0.0-alpha1, branch-2 and branch-2.8. Thanks 
[~leftnoteasy]!

> Point JDiff base version for HDFS from 2.6.0 to 2.7.2
> -
>
> Key: HDFS-10692
> URL: https://issues.apache.org/jira/browse/HDFS-10692
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Fix For: 2.8.0, 3.0.0-alpha1
>
> Attachments: 3.0.0-alpha1-jdiff-hdfs.zip, HDFS-10692.1.patch, 
> HDFS-10692.2.patch
>
>
> Now JDiff is pointed to 2.6.0, we need to upgrade it to the latest stable 
> release (2.7.2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10692) Point JDiff base version for HDFS from 2.6.0 to 2.7.2

2016-08-19 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10692:
---
Assignee: Wangda Tan
Hadoop Flags: Reviewed
  Status: Patch Available  (was: Open)

Straight forward patch with generated jdiff.

Will check this in if Jenkins says okay..

> Point JDiff base version for HDFS from 2.6.0 to 2.7.2
> -
>
> Key: HDFS-10692
> URL: https://issues.apache.org/jira/browse/HDFS-10692
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Wangda Tan
>Priority: Blocker
> Attachments: 3.0.0-alpha1-jdiff-hdfs.zip, HDFS-10692.1.patch, 
> HDFS-10692.2.patch
>
>
> Now JDiff is pointed to 2.6.0, we need to upgrade it to the latest stable 
> release (2.7.2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10692) Point JDiff base version for HDFS from 2.6.0 to 2.7.3

2016-08-18 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10692:
---
Target Version/s: 2.8.0, 3.0.0-alpha1  (was: 3.0.0-alpha1)

Let's get this into 2.8.0 also.. Marking as such..

> Point JDiff base version for HDFS from 2.6.0 to 2.7.3
> -
>
> Key: HDFS-10692
> URL: https://issues.apache.org/jira/browse/HDFS-10692
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Wangda Tan
>Priority: Blocker
> Attachments: 3.0.0-alpha1-jdiff-hdfs.zip, HDFS-10692.1.patch
>
>
> Now JDiff is pointed to 2.6.0, we need to upgrade it to the latest stable 
> release (2.7.3)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10495) Block should be marked as missing if the all the replicas are on Decommissioned nodes.

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10495:
---
Target Version/s: 2.8.0, 2.6.5, 2.7.4  (was: 2.8.0, 2.7.3, 2.6.5)

2.7.3 is under release process, changing target-version to 2.7.4.

> Block should be marked as missing if the all the replicas are on 
> Decommissioned nodes.
> --
>
> Key: HDFS-10495
> URL: https://issues.apache.org/jira/browse/HDFS-10495
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
>
> As discussed on HDFS-8872, we should mark a block as missing if all the 
> replicas on decommissioned nodes since we can take the decommissioned nodes 
> out of rotation anytime.
> We have seen multiple cases where all the replicas land on decommissioned 
> nodes.
> After HDFS-7933, it doesn't mark as missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8870) Lease is leaked on write failure

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8870:
--
Target Version/s: 2.6.5, 2.7.4  (was: 2.7.3, 2.6.5)

2.7.3 is under release process, changing target-version to 2.7.4.

> Lease is leaked on write failure
> 
>
> Key: HDFS-8870
> URL: https://issues.apache.org/jira/browse/HDFS-8870
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.6.0
>Reporter: Rushabh S Shah
>Assignee: Daryn Sharp
>
> Creating this ticket on behalf of [~daryn]
> We've seen this in our of our cluster. When a long running process has a 
> write failure, the lease is leaked and gets renewed until the token is 
> expired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9500) datanodesSoftwareVersions map may counting wrong when rolling upgrade

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9500:
--
Target Version/s: 2.6.5, 2.7.4  (was: 2.7.3, 2.6.5)

2.7.3 is under release process, changing target-version to 2.7.4.

> datanodesSoftwareVersions map may counting wrong when rolling upgrade
> -
>
> Key: HDFS-9500
> URL: https://issues.apache.org/jira/browse/HDFS-9500
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1, 2.6.2
>Reporter: Phil Yang
>Assignee: Phil Yang
> Attachments: 9500-v1.patch
>
>
> While rolling upgrading, namenode's website overview will report there are 
> two versions datanodes in the cluster, for example, 2.6.0 has x nodes and 
> 2.6.2 has y nodes. However, sometimes when I stop a datanode in old version 
> and start a new version one, namenode only increases the number of new 
> version but not decreases the number of old version. So the total number x+y 
> will be larger than the number of datanodes. Even all datanodes are upgraded, 
> there will still have the messages that there are several datanode in old 
> version. And I must run hdfs dfsadmin -refreshNodes to clear this message.
> I think this issue is caused by DatanodeManager.registerDatanode. If nodeS in 
> old version is not alive because of shutting down, it will not pass 
> shouldCountVersion, so the number of old version won't be decreased. But this 
> method only judges the status of heartbeat and isAlive at that moment, if 
> namenode has not removed this node which will decrease the version map and 
> this node restarts in the new version, the decrementVersionCount belongs to 
> this node will never be executed.
> So the simplest way to fix this is that we always recounting the version map 
> in registerDatanode since it is not a heavy operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9678) Standby NN sometimes does not clear needRollbackFsImage

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9678:
--
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Standby NN sometimes does not clear needRollbackFsImage
> ---
>
> Key: HDFS-9678
> URL: https://issues.apache.org/jira/browse/HDFS-9678
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>
> When the edit log loader sees {{OP_ROLLING_UPGRADE_START}}, it calls 
> {{setNeedRollbackFsImage(true)}}. This is cleared on a standby NN only by the 
> checkpointer thread when it actually creates a rollback image. 
> On {{OP_ROLLING_UPGRADE_FINALIZE}}, the rolling upgrade is finalized, but 
> {{needRollbackFsImage}} is not cleared, if a rollback image was never 
> created.  This result in perpetual checkpointing by the standby NN.
> The standby NN thinks it needs to do chekpointing because it needs to create 
> a rollback image, but since it is not in upgrade mode, it creates a regular 
> checkpoint, not a rollback image. As a result, the status is not cleared even 
> after creating checkpoint.
> The standby will keep checkpointing back-to-back and they will get uploaded 
> to the active constantly. We noticed this because of increased sync time on 
> the active.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8520) Patch for PPC64 block size

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8520:
--
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Patch for PPC64 block size
> --
>
> Key: HDFS-8520
> URL: https://issues.apache.org/jira/browse/HDFS-8520
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
> Environment: RHEL 7.1 /PPC64
>Reporter: Tony Reix
>Assignee: Tony Reix
>  Labels: ppc64
> Attachments: HDFS-8520-2.patch, HDFS-8520.patch
>
>
> The attached patch enables Hadoop to work on PPC64.
> That deals with SystemPageSize and BloclSize , which are not 4096 on PPC64.
> There are changes in 3 files:
> - 
> hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/nativeio/NativeIO.java
> - 
> hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestFsDatasetCache.java
> - 
> hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/TestCacheDirectives.java
> where 4096 is replaced by getOperatingSystemPageSize() or by using PAGE_SIZE
> The patch has been built on branch-2.7 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9038) DFS reserved space is erroneously counted towards non-DFS used.

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9038:
--
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> DFS reserved space is erroneously counted towards non-DFS used.
> ---
>
> Key: HDFS-9038
> URL: https://issues.apache.org/jira/browse/HDFS-9038
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.1
>Reporter: Chris Nauroth
>Assignee: Brahma Reddy Battula
> Attachments: GetFree.java, HDFS-9038-002.patch, HDFS-9038-003.patch, 
> HDFS-9038-004.patch, HDFS-9038-005.patch, HDFS-9038-006.patch, 
> HDFS-9038-007.patch, HDFS-9038-008.patch, HDFS-9038-009.patch, HDFS-9038.patch
>
>
> HDFS-5215 changed the DataNode volume available space calculation to consider 
> the reserved space held by the {{dfs.datanode.du.reserved}} configuration 
> property.  As a side effect, reserved space is now counted towards non-DFS 
> used.  I don't believe it was intentional to change the definition of non-DFS 
> used.  This issue proposes restoring the prior behavior: do not count 
> reserved space towards non-DFS used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-3745) fsck prints that it's using KSSL even when it's in fact using SPNEGO for authentication

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-3745:
--
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> fsck prints that it's using KSSL even when it's in fact using SPNEGO for 
> authentication
> ---
>
> Key: HDFS-3745
> URL: https://issues.apache.org/jira/browse/HDFS-3745
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client, security
>Affects Versions: 2.0.0-alpha
>Reporter: Aaron T. Myers
>Priority: Trivial
>  Labels: newbie
> Attachments: HDFS-3745.patch
>
>
> In branch-2 (which exclusively uses SPNEGO for HTTP authentication) and in 
> branch-1 (which can optionally use SPNEGO for HTTP authentication), running 
> fsck will print the following, which isn't quite right:
> {quote}
> FSCK started by hdfs (auth:KERBEROS_SSL) from...
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10348) Namenode report bad block method doesn't check whether the block belongs to datanode before adding it to corrupt replicas map.

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10348:
---
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Namenode report bad block method doesn't check whether the block belongs to 
> datanode before adding it to corrupt replicas map.
> --
>
> Key: HDFS-10348
> URL: https://issues.apache.org/jira/browse/HDFS-10348
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
> Attachments: HDFS-10348-1.patch, HDFS-10348.patch
>
>
> Namenode (via report bad block nethod) doesn't check whether the block 
> belongs to the datanode before it adds to corrupt replicas map.
> In one of our cluster we found that there were 3 lingering corrupt blocks.
> It happened in the following order.
> 1. Two clients called getBlockLocations for a particular file.
> 2. Client C1 tried to open the file and encountered checksum error from   
> node N3 and it reported bad block (blk1) to the namenode.
> 3. Namenode added that node N3 and block blk1  to corrrupt replicas map   and 
> ask one of the good node (one of the 2 nodes) to replicate the block to 
> another node N4.
> 4. After receiving the block, N4 sends an IBR (with RECEIVED_BLOCK) to 
> namenode.
> 5. Namenode removed the block and node N3 from corrupt replicas map.
>It also removed N3's storage from triplets and queued an invalidate 
> request for N3.
> 6. In the mean time, Client C2 tries to open the file and the request went to 
> node N3.
>C2 also encountered the checksum exception and reported bad block to 
> namenode.
> 7. Namenode added the corrupt block blk1 and node N3 to the corrupt replicas 
> map without confirming whether node N3 has the block or not.
> After deleting the block, N3 sends an IBR (with DELETED) and the namenode 
> simply ignores the report since the N3's storage is no longer in the 
> triplets(from step 5)
> We took the node out of rotation, but still the block was present only in the 
> corruptReplciasMap. 
> Since on removing the node, we only goes through the block which are present 
> in the triplets for a given datanode.
> [~kshukla]'s patch fixed this bug via 
> https://issues.apache.org/jira/browse/HDFS-9958.
> But I think the following check should be made in the 
> BlockManager#markBlockAsCorrupt instead of 
> BlockManager#findAndMarkBlockAsCorrupt.
> {noformat}
> if (storage == null) {
>   storage = storedBlock.findStorageInfo(node);
> }
> if (storage == null) {
>   blockLog.debug("BLOCK* findAndMarkBlockAsCorrupt: {} not found on {}",
>   blk, dn);
>   return;
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10627) Volume Scanner mark a block as "suspect" even if the block sender encounters 'Broken pipe' or 'Connection reset by peer' exception

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10627:
---
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Volume Scanner mark a block as "suspect" even if the block sender encounters 
> 'Broken pipe' or 'Connection reset by peer' exception
> --
>
> Key: HDFS-10627
> URL: https://issues.apache.org/jira/browse/HDFS-10627
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
> Attachments: HDFS-10627.patch
>
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
> if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection 
> reset")) {
>   LOG.error("BlockSender.sendChunks() exception: ", e);
> }
> datanode.getBlockScanner().markSuspectBlock(
>   volumeRef.getVolume().getStorageID(),
>   block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception 
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the 
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go 
> through all the suspect blocks to scan one corrupt block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10326) Disable setting tcp socket send/receive buffers for write pipelines

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10326:
---
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Disable setting tcp socket send/receive buffers for write pipelines
> ---
>
> Key: HDFS-10326
> URL: https://issues.apache.org/jira/browse/HDFS-10326
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-10326.000.patch, HDFS-10326.001.patch, 
> HDFS-10326.001.patch
>
>
> The DataStreamer and the Datanode use a hardcoded 
> DEFAULT_DATA_SOCKET_SIZE=128K for the send and receive buffers of a write 
> pipeline.  Explicitly setting tcp buffer sizes disables tcp stack 
> auto-tuning.  
> The hardcoded value will saturate a 1Gb with 1ms RTT.  105Mbs at 10ms.  
> Paltry 11Mbs over a 100ms long haul.  10Gb networks are underutilized.
> There should either be a configuration to completely disable setting the 
> buffers, or the the setReceiveBuffer and setSendBuffer should be removed 
> entirely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8869) Don't mark storages as failed before first block report

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8869:
--
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Don't mark storages as failed before first block report
> ---
>
> Key: HDFS-8869
> URL: https://issues.apache.org/jira/browse/HDFS-8869
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Rushabh S Shah
>Assignee: Daryn Sharp
>
> Creating this ticket on behalf of [~daryn].
> Heartbeat processing performs the failed storage check. The DN reports its 
> storages and any prior missing storages, ex. unique storage id upgrade, are 
> marked failed. The heartbeat monitor removes all blocks associated to the 
> failed storage. A replication storm ensues for all blocks on the node.
> Eventually the DN block reports for the new storages - up to 15m later for 
> large clusters. Now the NN has many excess blocks to invalidate. If the 
> cluster has failed over in the past 24h, ex. rolling upgrade, the standby 
> gone active will queue the block invalidations which triggers the severe 
> performance degradation of HDFS-8674 which has been greatly lessened but is 
> still an issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8675) IBRs from dead DNs go into infinite loop

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8675:
--
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> IBRs from dead DNs go into infinite loop
> 
>
> Key: HDFS-8675
> URL: https://issues.apache.org/jira/browse/HDFS-8675
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>
> If the DN sends an IBR after the NN declares it dead, the NN returns an IOE 
> of unregistered or dead.  The DN catches the IOE, ignores it, and infinitely 
> loops spamming the NN with retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9666:
--
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: ade
> Attachments: HDFS-9666.0.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10477:
---
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Stop decommission a rack of DataNodes caused NameNode fail over to standby
> --
>
> Key: HDFS-10477
> URL: https://issues.apache.org/jira/browse/HDFS-10477
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.2
>Reporter: yunjiong zhao
>Assignee: yunjiong zhao
> Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, 
> HDFS-10477.004.patch, HDFS-10477.005.patch, HDFS-10477.patch
>
>
> In our cluster, when we stop decommissioning a rack which have 46 DataNodes, 
> it locked Namesystem for about 7 minutes as below log shows:
> {code}
> 2016-05-26 20:11:41,697 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.27:1004
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning
> 2016-05-26 20:11:51,171 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.118:1004
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning
> 2016-05-26 20:11:59,972 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.113:1004
> 2016-05-26 20:12:09,007 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning
> 2016-05-26 20:12:09,008 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.117:1004
> 2016-05-26 20:12:18,055 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning
> 2016-05-26 20:12:18,056 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.130:1004
> 2016-05-26 20:12:25,938 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning
> 2016-05-26 20:12:25,939 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.121:1004
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning
> 2016-05-26 20:12:34,134 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.33:1004
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning
> 2016-05-26 20:12:43,020 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.137:1004
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning
> 2016-05-26 20:12:52,220 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.51:1004
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning
> 2016-05-26 20:13:00,362 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.12:1004
> 2016-05-26 20:13:08,756 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning
> 2016-05-26 20:13:08,757 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.15:1004
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning
> 2016-05-26 20:13:17,185 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop 
> Decommissioning 10.142.27.14:1004
> 2016-05-26 20:13:25,369 INFO 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 
> 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning
> 2016-05-26 20:13:25,370 INFO 
> 

[jira] [Updated] (HDFS-8307) Spurious DNS Queries from hdfs shell

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8307:
--
Target Version/s: 2.7.4  (was: 2.7.3)

2.7.3 is under release process, changing target-version to 2.7.4.

> Spurious DNS Queries from hdfs shell
> 
>
> Key: HDFS-8307
> URL: https://issues.apache.org/jira/browse/HDFS-8307
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.7.1
>Reporter: Anu Engineer
>Priority: Trivial
>
> With HA configured the hdfs shell (org.apache.hadoop.fs.FsShell) seems to 
> issue a DNS query for the cluster Name. if  fs.defaultFS is set to 
> hdfs://mycluster, then the shell seems to issue a DNS query for 
> mycluster.FQDN or mycluster.
> since mycluster not a machine name  DNS query always fails with 
> "DNS 85 Standard query response 0x2aeb No such name"
> Repro Steps:
> # Setup a HA cluster 
> # Log on to any node
> # Run wireshark monitoring port 53 - "sudo tshark 'port 53'"
> # Run "sudo -u hdfs hdfs dfs -ls /" 
> # You should be able to see DNS queries to mycluster.FQDN in wireshark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9530) ReservedSpace is not cleared for abandoned Blocks

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9530:
--
Resolution: Fixed
Status: Resolved  (was: Patch Available)

bq. Reopening the issue to attach the branch-2.6 patch and run jenkins against 
this..
Closing this again for the 2.7.3 release process. If you just want to use 
Jenkins for 2.6 patch, you can create a clone and use that.

> ReservedSpace is not cleared for abandoned Blocks
> -
>
> Key: HDFS-9530
> URL: https://issues.apache.org/jira/browse/HDFS-9530
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.7.1
>Reporter: Fei Hui
>Assignee: Brahma Reddy Battula
>Priority: Critical
> Fix For: 2.7.3
>
> Attachments: HDFS-9530-01.patch, HDFS-9530-02.patch, 
> HDFS-9530-03.patch, HDFS-9530-branch-2.6.patch, 
> HDFS-9530-branch-2.7-001.patch, HDFS-9530-branch-2.7-002.patch
>
>
> i think there are bugs in HDFS
> ===
> here is config
>   
> dfs.datanode.data.dir
> 
> 
> file:///mnt/disk4,file:///mnt/disk1,file:///mnt/disk3,file:///mnt/disk2
> 
>   
> here is dfsadmin report 
> [hadoop@worker-1 ~]$ hadoop dfsadmin -report
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
> Configured Capacity: 240769253376 (224.23 GB)
> Present Capacity: 238604832768 (222.22 GB)
> DFS Remaining: 215772954624 (200.95 GB)
> DFS Used: 22831878144 (21.26 GB)
> DFS Used%: 9.57%
> Under replicated blocks: 4
> Blocks with corrupt replicas: 0
> Missing blocks: 0
> -
> Live datanodes (3):
> Name: 10.117.60.59:50010 (worker-2)
> Hostname: worker-2
> Decommission Status : Normal
> Configured Capacity: 80256417792 (74.74 GB)
> DFS Used: 7190958080 (6.70 GB)
> Non DFS Used: 721473536 (688.05 MB)
> DFS Remaining: 72343986176 (67.38 GB)
> DFS Used%: 8.96%
> DFS Remaining%: 90.14%
> Configured Cache Capacity: 0 (0 B)
> Cache Used: 0 (0 B)
> Cache Remaining: 0 (0 B)
> Cache Used%: 100.00%
> Cache Remaining%: 0.00%
> Xceivers: 1
> Last contact: Wed Dec 09 15:55:02 CST 2015
> Name: 10.168.156.0:50010 (worker-3)
> Hostname: worker-3
> Decommission Status : Normal
> Configured Capacity: 80256417792 (74.74 GB)
> DFS Used: 7219073024 (6.72 GB)
> Non DFS Used: 721473536 (688.05 MB)
> DFS Remaining: 72315871232 (67.35 GB)
> DFS Used%: 9.00%
> DFS Remaining%: 90.11%
> Configured Cache Capacity: 0 (0 B)
> Cache Used: 0 (0 B)
> Cache Remaining: 0 (0 B)
> Cache Used%: 100.00%
> Cache Remaining%: 0.00%
> Xceivers: 1
> Last contact: Wed Dec 09 15:55:03 CST 2015
> Name: 10.117.15.38:50010 (worker-1)
> Hostname: worker-1
> Decommission Status : Normal
> Configured Capacity: 80256417792 (74.74 GB)
> DFS Used: 8421847040 (7.84 GB)
> Non DFS Used: 721473536 (688.05 MB)
> DFS Remaining: 71113097216 (66.23 GB)
> DFS Used%: 10.49%
> DFS Remaining%: 88.61%
> Configured Cache Capacity: 0 (0 B)
> Cache Used: 0 (0 B)
> Cache Remaining: 0 (0 B)
> Cache Used%: 100.00%
> Cache Remaining%: 0.00%
> Xceivers: 1
> Last contact: Wed Dec 09 15:55:03 CST 2015
> 
> when running hive job , dfsadmin report as follows
> [hadoop@worker-1 ~]$ hadoop dfsadmin -report
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
> Configured Capacity: 240769253376 (224.23 GB)
> Present Capacity: 108266011136 (100.83 GB)
> DFS Remaining: 80078416384 (74.58 GB)
> DFS Used: 28187594752 (26.25 GB)
> DFS Used%: 26.04%
> Under replicated blocks: 7
> Blocks with corrupt replicas: 0
> Missing blocks: 0
> -
> Live datanodes (3):
> Name: 10.117.60.59:50010 (worker-2)
> Hostname: worker-2
> Decommission Status : Normal
> Configured Capacity: 80256417792 (74.74 GB)
> DFS Used: 9015627776 (8.40 GB)
> Non DFS Used: 44303742464 (41.26 GB)
> DFS Remaining: 26937047552 (25.09 GB)
> DFS Used%: 11.23%
> DFS Remaining%: 33.56%
> Configured Cache Capacity: 0 (0 B)
> Cache Used: 0 (0 B)
> Cache Remaining: 0 (0 B)
> Cache Used%: 100.00%
> Cache Remaining%: 0.00%
> Xceivers: 693
> Last contact: Wed Dec 09 15:37:35 CST 2015
> Name: 10.168.156.0:50010 (worker-3)
> Hostname: worker-3
> Decommission Status : Normal
> Configured Capacity: 80256417792 (74.74 GB)
> DFS Used: 9163116544 (8.53 GB)
> Non DFS Used: 47895897600 (44.61 GB)
> DFS Remaining: 23197403648 (21.60 GB)
> DFS Used%: 11.42%
> DFS Remaining%: 28.90%
> Configured Cache Capacity: 0 (0 B)
> Cache Used: 0 (0 B)
> Cache Remaining: 0 (0 B)
> Cache Used%: 100.00%
> Cache Remaining%: 0.00%
> Xceivers: 750
> Last contact: Wed Dec 09 

[jira] [Updated] (HDFS-9395) Make HDFS audit logging consistant

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9395:
--
Target Version/s:   (was: 2.7.3)

> Make HDFS audit logging consistant
> --
>
> Key: HDFS-9395
> URL: https://issues.apache.org/jira/browse/HDFS-9395
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kuhu Shukla
> Fix For: 3.0.0-alpha1
>
> Attachments: HDFS-9395-branch-2.7.001.patch, 
> HDFS-9395-branch-2.7.002.patch, HDFS-9395.001.patch, HDFS-9395.002.patch, 
> HDFS-9395.003.patch, HDFS-9395.004.patch, HDFS-9395.005.patch, 
> HDFS-9395.006.patch, HDFS-9395.007.patch
>
>
> So, the big question here is what should go in the audit log? All failures, 
> or just "permission denied" failures? Or, to put it a different way, if 
> someone attempts to do something and it fails because a file doesn't exist, 
> is that worth an audit log entry?
> We are currently inconsistent on this point. For example, concat, 
> getContentSummary, addCacheDirective, and setErasureEncodingPolicy create an 
> audit log entry for all failures, but setOwner, delete, and setAclEntries 
> attempt to only create an entry for AccessControlException-based failures. 
> There are a few operations, like allowSnapshot, disallowSnapshot, and 
> startRollingUpgrade that never create audit log failure entries at all. They 
> simply log nothing for any failure, and log success for a successful 
> operation.
> So to summarize, different HDFS operations currently fall into 3 categories:
> 1. audit-log all failures
> 2. audit-log only AccessControlException failures
> 3. never audit-log failures
> Which category is right?  And how can we fix the inconsistency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9395) Make HDFS audit logging consistant

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9395:
--
Fix Version/s: (was: 2.7.3)
   3.0.0-alpha1

Just reverted this incompatible change from branch-2, branch-2.8 and 
branch-2.7.3 after [~aw] pointed this out on 2.7.3 RC1 voting thread.

[~kshukla] / [~kihwal], can you comment on why this incompatible change is 
pushed into branch-2.*?

> Make HDFS audit logging consistant
> --
>
> Key: HDFS-9395
> URL: https://issues.apache.org/jira/browse/HDFS-9395
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kuhu Shukla
> Fix For: 3.0.0-alpha1
>
> Attachments: HDFS-9395-branch-2.7.001.patch, 
> HDFS-9395-branch-2.7.002.patch, HDFS-9395.001.patch, HDFS-9395.002.patch, 
> HDFS-9395.003.patch, HDFS-9395.004.patch, HDFS-9395.005.patch, 
> HDFS-9395.006.patch, HDFS-9395.007.patch
>
>
> So, the big question here is what should go in the audit log? All failures, 
> or just "permission denied" failures? Or, to put it a different way, if 
> someone attempts to do something and it fails because a file doesn't exist, 
> is that worth an audit log entry?
> We are currently inconsistent on this point. For example, concat, 
> getContentSummary, addCacheDirective, and setErasureEncodingPolicy create an 
> audit log entry for all failures, but setOwner, delete, and setAclEntries 
> attempt to only create an entry for AccessControlException-based failures. 
> There are a few operations, like allowSnapshot, disallowSnapshot, and 
> startRollingUpgrade that never create audit log failure entries at all. They 
> simply log nothing for any failure, and log success for a successful 
> operation.
> So to summarize, different HDFS operations currently fall into 3 categories:
> 1. audit-log all failures
> 2. audit-log only AccessControlException failures
> 3. never audit-log failures
> Which category is right?  And how can we fix the inconsistency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-8402) Fsck exit codes are not reliable

2016-08-17 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8402:
--
Target Version/s:   (was: 2.7.3)

Dropping 2.7.x target-version till the above compatibility discussion closes.

> Fsck exit codes are not reliable
> 
>
> Key: HDFS-8402
> URL: https://issues.apache.org/jira/browse/HDFS-8402
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Attachments: HDFS-8402.patch
>
>
> HDFS-6663 added the ability to check specific blocks.  The exit code is 
> non-deterministically based on the state (corrupt, healthy, etc) of the last 
> displayed block's last storage location - instead of whether any of the 
> checked blocks' storages are corrupt.  Blocks with decommissioning or 
> decommissioned nodes should not be flagged as an error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10544) Balancer doesn't work with IPFailoverProxyProvider

2016-07-20 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10544:
---
Fix Version/s: (was: 2.7.3)
   2.7.4

2.7.3 was already under process, changing the fix-version to 2.7.4.

> Balancer doesn't work with IPFailoverProxyProvider
> --
>
> Key: HDFS-10544
> URL: https://issues.apache.org/jira/browse/HDFS-10544
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover, ha
>Affects Versions: 2.6.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
> Fix For: 2.8.0, 2.9.0, 2.6.5, 2.7.4, 3.0.0-alpha1
>
> Attachments: HDFS-10544-branch-2.7.patch, HDFS-10544.00.patch, 
> HDFS-10544.01.patch, HDFS-10544.02.patch, HDFS-10544.03.patch, 
> HDFS-10544.04.patch, HDFS-10544.05.patch
>
>
> Right now {{Balancer}} gets the NN URIs through 
> {{DFSUtil#getNameServiceUris}}, which returns logical URIs in HA is enabled. 
> If {{IPFailoverProxyProvider}} is used, {{Balancer}} will not be able to 
> start.
> I think the bug is at {{DFSUtil#getNameServiceUris}}:
> {code}
> for (String nsId : getNameServiceIds(conf)) {
>   if (HAUtil.isHAEnabled(conf, nsId)) {
> // Add the logical URI of the nameservice.
> try {
>   ret.add(new URI(HdfsConstants.HDFS_URI_SCHEME + "://" + nsId));
> {code}
> Then {{if}} clause should also consider if the {{FailoverProxyProvider}} has 
> {{useLogicalURI}} enabled. If not, {{getNameServiceUris}} should try to 
> resolve the physical URI for this nsId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-8914) Document HA support in the HDFS HdfsDesign.md

2016-07-20 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli resolved HDFS-8914.
---
Resolution: Fixed

Closing this again.

> Document HA support in the HDFS HdfsDesign.md
> -
>
> Key: HDFS-8914
> URL: https://issues.apache.org/jira/browse/HDFS-8914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 2.7.1
> Environment: Documentation page in live
>Reporter: Ravindra Babu
>Assignee: Lars Francke
> Fix For: 2.8.0, 2.7.3
>
> Attachments: HDFS-8914.1.patch, HDFS-8914.2.patch
>
>
> Please refer to these two links and correct one of them.
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
> The NameNode machine is a single point of failure for an HDFS cluster. If the 
> NameNode machine fails, manual intervention is necessary. Currently, 
> automatic restart and failover of the NameNode software to another machine is 
> not supported.
> http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
> The HDFS High Availability feature addresses the above problems by providing 
> the option of running two redundant NameNodes in the same cluster in an 
> Active/Passive configuration with a hot standby. This allows a fast failover 
> to a new NameNode in the case that a machine crashes, or a graceful 
> administrator-initiated failover for the purpose of planned maintenance.
> Please update hdfsDesign article with same facts to avoid confusion in 
> Reader's mind..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10488) Update WebHDFS documentation regarding CREATE and MKDIR default permissions

2016-07-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371800#comment-15371800
 ] 

Vinod Kumar Vavilapalli commented on HDFS-10488:


This never made it to branch-2.7.3. I just merged it in.

> Update WebHDFS documentation regarding CREATE and MKDIR default permissions
> ---
>
> Key: HDFS-10488
> URL: https://issues.apache.org/jira/browse/HDFS-10488
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: documentation, webhdfs
>Affects Versions: 2.6.0
>Reporter: Wellington Chevreuil
>Assignee: Wellington Chevreuil
>Priority: Minor
> Fix For: 2.8.0, 2.7.3, 3.0.0-alpha1
>
> Attachments: HDFS-10488.002.patch, HDFS-10488.003.patch, 
> HDFS-10488.005.patch, HDFS-10488.006.patch, HDFS-10488.patch
>
>
> WebHDFS methods for creating file/directories were always creating it with 
> 755 permissions as default for both files and directories. 
> The configured *fs.permissions.umask-mode* is intentionally ignored.
> This jira is to update the Documentation properly, explaining *umask* is not 
> applied when using WebHDFS related methods.
> HDFS-6434 has also modified the default permissions for files, which is now 
> *644*. This will also be updated on the current documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10537) eclipse running NameNode Class Exception

2016-07-11 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10537:
---
Fix Version/s: (was: 2.7.3)

> eclipse running NameNode Class Exception
> 
>
> Key: HDFS-10537
> URL: https://issues.apache.org/jira/browse/HDFS-10537
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.7.0
>Reporter: .D.
>Priority: Minor
>
> I import hadoop2.7.0 source code to eclipse workspace
> I like in eclipse runinng NameNode.java
> args = "-fromat"
> get Error Message:
> 2016-06-16 22:50:09,074 ERROR namenode.NameNode (NameNode.java:main(1558)) - 
> Failed to start namenode.
> java.lang.IllegalArgumentException: URI has an authority component
>   at java.io.File.(File.java:423)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NNStorage.getStorageDirectory(NNStorage.java:329)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:276)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournalsForWrite(FSEditLog.java:247)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:984)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1428)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553)
> 2016-06-16 22:50:09,076 INFO  util.ExitUtil (ExitUtil.java:terminate(124)) - 
> Exiting with status 1
> 2016-06-16 22:50:09,078 INFO  namenode.NameNode (LogAdapter.java:info(47)) - 
> SHUTDOWN_MSG: 
> /
> SHUTDOWN_MSG: Shutting down NameNode at -Pro.local/127.0.0.1
> /
> I will core-site.xml and hdfs-site.xml join hadoop project down
> core-site.xml
> 
>   
>   
>   fs.defaultFS
>   hdfs://master:9000/
>   
>   
>   
>   hadoop.tmp.dir
>   file:///Users/Joker/tmp
>   Abase for other temporary 
> directories.
>   
> 
> hdfs-site.xml
> 
>   
>   dfs.replication
>   1
>   
>   
>   dfs.namenode.name.dir
>   
> file:///Users/Joker/Documents/code_framework/java/hadoop-2.7.0/dfs/name
>   
>   
>   dfs.datanode.data.dir
>   
> file:///Users/Joker/Documents/code_framework/java/hadoop-2.7.0/dfs/data
>   
> 
> Thanks !



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7597) DNs should not open new NN connections when webhdfs clients seek

2016-06-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326712#comment-15326712
 ] 

Vinod Kumar Vavilapalli commented on HDFS-7597:
---

[~yzhangal] / [~daryn], [~cnauroth] and [~xiaobingo], please get this committed 
asap if you want this in the alpha 2.8.0 release. If not, I'll move it out 
within a week to the next release.

> DNs should not open new NN connections when webhdfs clients seek
> 
>
> Key: HDFS-7597
> URL: https://issues.apache.org/jira/browse/HDFS-7597
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7597.patch, HDFS-7597.patch, HDFS-7597.patch
>
>
> Webhdfs seeks involve closing the current connection, and reissuing a new 
> open request with the new offset.  The RPC layer caches connections so the DN 
> keeps a lingering connection open to the NN.  Connection caching is in part 
> based on UGI.  Although the client used the same token for the new offset 
> request, the UGI is different which forces the DN to open another unnecessary 
> connection to the NN.
> A job that performs many seeks will easily crash the NN due to fd exhaustion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10289) Balancer configures DNs directly

2016-06-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-10289:
---
Target Version/s:   (was: 2.8.0)

Not much going on here for a while, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Balancer configures DNs directly
> 
>
> Key: HDFS-10289
> URL: https://issues.apache.org/jira/browse/HDFS-10289
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: balancer & mover
>Affects Versions: 2.6.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Critical
>
> Balancer directly configures the 2 balance-related properties 
> (bandwidthPerSec and concurrentMoves) on the DNs involved.
> Details:
> * Before each balancing iteration, set the properties on all DNs involved in 
> the current iteration.
> * The DN property changes will not survive restart.
> * Balancer gets the property values from command line or its config file.
> * Need new DN APIs to query and set the 2 properties.
> * No need to edit the config file on each DN or run {{hdfs dfsadmin 
> -setBalancerBandwidth}} to configure every DN in the cluster.
> Pros:
> * Improve ease of use because all configurations are done at one place, the 
> balancer. We saw many customers often forgot to set concurrentMoves properly 
> since it is required on both DN and Balancer.
> * Support new DNs added between iterations
> * Handle DN restarts between iterations
> * May be able to dynamically adjust the thresholds in different iterations. 
> Don't know how useful though.
> Cons:
> * New DN property API
> * A malicious/misconfigured balancer may overwhelm DNs. {{hdfs dfsadmin 
> -setBalancerBandwidth}} has the same issue. Also Balancer can only be run by 
> admin.
> Questions:
> * Can we create {{BalancerConcurrentMovesCommand}} similar to 
> {{BalancerBandwidthCommand}}? Can Balancer use them directly without going 
> through NN?
> One proposal to implement HDFS-7466 calls for an API to query DN properties. 
> DN Conf Servlet returns all config properties. It does not return individual 
> property and it does not return the value set by {{hdfs dfsadmin 
> -setBalancerBandwidth}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9053) Support large directories efficiently using B-Tree

2016-06-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9053:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-4754) Add an API in the namenode to mark a datanode as stale

2016-06-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-4754:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Add an API in the namenode to mark a datanode as stale
> --
>
> Key: HDFS-4754
> URL: https://issues.apache.org/jira/browse/HDFS-4754
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client, namenode
>Reporter: Nicolas Liochon
>Assignee: Nicolas Liochon
>Priority: Critical
>  Labels: BB2015-05-TBR
> Attachments: 4754.v1.patch, 4754.v2.patch, 4754.v4.patch, 
> 4754.v4.patch
>
>
> There is a detection of the stale datanodes in HDFS since HDFS-3703, with a 
> timeout, defaulted to 30s.
> There are two reasons to add an API to mark a node as stale even if the 
> timeout is not yet reached:
>  1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, 
> we sometimes start the recovery before a node is marked staled. (even with 
> reasonable settings as: stale: 20s; HBase ZK timeout: 30s
>  2) Some third parties could detect that a node is dead before the timeout, 
> hence saving us the cost of retrying. An example or such hw is Arista, 
> presented here by [~tsuna] 
> http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in 
> HBASE-6290.
> As usual, even if the node is dead it can comeback before the 10 minutes 
> limit. So I would propose to set a timebound. The API would be
> namenode.markStale(String ipAddress, int port, long durationInMs);
> After durationInMs, the namenode would again rely only on its heartbeat to 
> decide.
> Thoughts?
> If there is no objections, and if nobody in the hdfs dev team has the time to 
> spend some time on it, I will give it a try for branch 2 & 3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7959) WebHdfs logging is missing on Datanode

2016-06-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7959:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> WebHdfs logging is missing on Datanode
> --
>
> Key: HDFS-7959
> URL: https://issues.apache.org/jira/browse/HDFS-7959
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7959.patch, HDFS-7959.patch, HDFS-7959.patch
>
>
> After the conversion to netty, webhdfs requests are not logged on datanodes. 
> The existing jetty log only logs the non-webhdfs requests that come through 
> the internal proxy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7174) Support for more efficient large directories

2016-06-12 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7174:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Support for more efficient large directories
> 
>
> Key: HDFS-7174
> URL: https://issues.apache.org/jira/browse/HDFS-7174
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
>  Labels: BB2015-05-TBR
> Attachments: HDFS-7174.new.patch, HDFS-7174.patch, HDFS-7174.patch
>
>
> When the number of children under a directory grows very large, insertion 
> becomes very costly.  E.g. creating 1M entries takes 10s of minutes.  This is 
> because the complexity of an insertion is O\(n\). As the size of a list 
> grows, the overhead grows n^2. (integral of linear function).  It also causes 
> allocations and copies of big arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7822) Make webhdfs handling of URI standard compliant

2016-06-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7822:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Make webhdfs handling of URI standard compliant
> ---
>
> Key: HDFS-7822
> URL: https://issues.apache.org/jira/browse/HDFS-7822
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Reporter: Kihwal Lee
>Priority: Critical
>
> As seen in HDFS-7816, webhdfs client is not encoding URI properly. But since 
> webhdfs is often used as the compatibility layer, we cannot simply fix it and 
> break the compatibility. Instead, we should stage the fix so that breakages 
> caused by incompatibility can be minimized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-5500) Critical datanode threads may terminate silently on uncaught exceptions

2016-06-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-5500:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Critical datanode threads may terminate silently on uncaught exceptions
> ---
>
> Key: HDFS-5500
> URL: https://issues.apache.org/jira/browse/HDFS-5500
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Priority: Critical
>
> We've seen refreshUsed (DU) thread disappearing on uncaught exceptions. This 
> can go unnoticed for a long time.  If OOM occurs, more things can go wrong.  
> On one occasion, Timer, multiple refreshUsed and DataXceiverServer thread had 
> terminated.  
> DataXceiverServer catches OutOfMemoryError and sleeps for 30 seconds, but I 
> am not sure it is really helpful. In once case, the thread did it multiple 
> times then terminated. I suspect another OOM was thrown while in a catch 
> block.  As a result, the server socket was not closed and clients hung on 
> connect. If it had at least closed the socket, client-side would have been 
> impacted less.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7607) Use random rack-local node for webhdfs opens to avoid OOM on DNs

2016-06-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7607:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Use random rack-local node for webhdfs opens to avoid OOM on DNs
> 
>
> Key: HDFS-7607
> URL: https://issues.apache.org/jira/browse/HDFS-7607
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, webhdfs
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Webhdfs currently redirects a client to the DN that physically has one of the
> replicas.  Unlike the hdfs data streamer protocol which can easily handle 
> hundreds or thousands of connections, jetty has poor performance under heavy 
> load.  Webhdfs clients can easily overwhelm the DNs and likely cause OOMs or 
> excessive GC.
> The NN should redirect the client to a rack-local location to distribute the 
> webhdfs load across multiple hosts.  The rack can then use the lightweight 
> streamer protocol.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-6358) WebHdfs DN's DFSClient should not use a retry policy

2016-06-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-6358:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> WebHdfs DN's DFSClient should not use a retry policy
> 
>
> Key: HDFS-6358
> URL: https://issues.apache.org/jira/browse/HDFS-6358
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> DFSClient retries on the DN are useless.  The webhdfs client is going to 
> timeout before the retries complete.  The DFSClient will also continue to run 
> until it timeouts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-6359) WebHdfs NN servlet issues redirects in safemode or standby

2016-06-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-6359:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> WebHdfs NN servlet issues redirects in safemode or standby
> --
>
> Key: HDFS-6359
> URL: https://issues.apache.org/jira/browse/HDFS-6359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Webhdfs does not check for safemode or standby during issuing a redirect for 
> open/create/checksum calls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-5946) Webhdfs DN choosing code is flawed

2016-06-05 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-5946:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Webhdfs DN choosing code is flawed
> --
>
> Key: HDFS-5946
> URL: https://issues.apache.org/jira/browse/HDFS-5946
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode, webhdfs
>Affects Versions: 2.4.0, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Priority: Critical
>
> HDFS-5891 improved the performance of redirecting webhdfs clients to a DN.  
> Instead of attempting a connection with a 1-minute timeout, the NN skips 
> decommissioned nodes.
> The logic appears flawed.  It finds the index of the first decommissioned 
> node, if any, then:
> * Throws an exception if index = 0, even if other nodes later in the list are 
> not decommissioned.
> * Else picks a random node prior to the index.  Let's say there are 10 
> replicas, 2nd location is decommissioned.  All clients will be redirected to 
> the first location even though there are 8 other valid locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-5293) Symlink resolution requires unnecessary RPCs

2016-06-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-5293:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> Symlink resolution requires unnecessary RPCs
> 
>
> Key: HDFS-5293
> URL: https://issues.apache.org/jira/browse/HDFS-5293
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Priority: Critical
>
> When the NN encounters a symlink, it throws an {{UnresolvedLinkException}}.  
> This exception contains only the path that is a symlink.  The client issues 
> another RPC to obtain the link target, followed by another RPC with the link 
> target + remainder of the original path.
> {{UnresolvedLinkException}} should be returning both the link and the target 
> to avoid a costly and unnecessary intermediate RPC to obtain the link target.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-6401) WebHdfs should always use the network failover policy

2016-06-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-6401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-6401:
--
Target Version/s:   (was: 2.8.0)

Not much going on here for a long time, dropping from 2.8.0.

Not putting any target-version either anymore, let's target this depending on 
when there is patch activity.

> WebHdfs should always use the network failover policy
> -
>
> Key: HDFS-6401
> URL: https://issues.apache.org/jira/browse/HDFS-6401
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Webhdfs only uses the network failover policy if HA is enabled.  The policy 
> adds retries for exceptions such as connect failures which are always useful. 
>  The proxy also provides support for standby and retriable exceptions which 
> are required for HA IP-based failover because the client does not know if the 
> NN is HA capable or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10271) Extra bytes are getting released from reservedSpace for append

2016-04-11 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235758#comment-15235758
 ] 

Vinod Kumar Vavilapalli commented on HDFS-10271:


Can somebody from HDFS please help review / commit this? It's one of the last 
few blocking tickets for 2.7.3!

> Extra bytes are getting released from reservedSpace for append
> --
>
> Key: HDFS-10271
> URL: https://issues.apache.org/jira/browse/HDFS-10271
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Critical
> Attachments: HDFS-10271-01.patch, HDFS-10271-branch-2.7-01.patch
>
>
> 1. File already have some bytes available in block. (ex: 1024B)
> 2. Re-open the file for append, (Here reserving for (BlockSize-1024) bytes)
> 3. write one byte and flush, 
> 4. close()
> After close(), releasing *BlockSize-1* bytes from reservedspace instead of 
> *BlockSize-1025* bytes.
> Extra bytes reserved may create problem for other writers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8674) Improve performance of postponed block scans

2016-04-10 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8674:
--
Target Version/s: 2.7.4  (was: 2.7.3)

Haven't gotten any update, dropping this off into 2.7.4..

> Improve performance of postponed block scans
> 
>
> Key: HDFS-8674
> URL: https://issues.apache.org/jira/browse/HDFS-8674
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-8674.patch, HDFS-8674.patch
>
>
> When a standby goes active, it marks all nodes as "stale" which will cause 
> block invalidations for over-replicated blocks to be queued until full block 
> reports are received from the nodes with the block.  The replication monitor 
> scans the queue with O(N) runtime.  It picks a random offset and iterates 
> through the set to randomize blocks scanned.
> The result is devastating when a cluster loses multiple nodes during a 
> rolling upgrade. Re-replication occurs, the nodes come back, the excess block 
> invalidations are postponed. Rescanning just 2k blocks out of millions of 
> postponed blocks may take multiple seconds. During the scan, the write lock 
> is held which stalls all other processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8871) Decommissioning of a node with a failed volume may not start

2016-04-10 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8871:
--
Target Version/s:   (was: 2.7.3, 2.6.5)

Removing target-version off this long-standing issue, please add it back once 
there is a patch available for releasing. Tx.

> Decommissioning of a node with a failed volume may not start
> 
>
> Key: HDFS-8871
> URL: https://issues.apache.org/jira/browse/HDFS-8871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Kihwal Lee
>Assignee: Daryn Sharp
>Priority: Critical
>
> Since staleness may not be properly cleared, a node with a failed volume may 
> not actually get scanned for block replication. Nothing is being replicated 
> from these nodes.
> This bug does not manifest unless the datanode has a unique storage ID per 
> volume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2016-04-06 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15229398#comment-15229398
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8791:
---

Tx [~kihwal]!

As [~kihwal] pointed out 
[here|https://issues.apache.org/jira/browse/HDFS-8791?focusedCommentId=15226253=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15226253],
 we had downgrades even in dot releases as a requirement in the original design 
doc, but we haven't been respecting those.

Before we move on, I think we should converge to a plan of action for future 
changes that may affect downgrade scenarios - perhaps as a different JIRA - 
[~cnauroth] / [~kihwal] / [~szetszwo], is it possible for one of you to take 
charge on this? Thanks!

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Fix For: 2.8.0
>
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, HDFS-8791-trunk-v3-bin.patch, 
> hadoop-56-layout-datanode-dir.tgz, test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225482#comment-15225482
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8791:
---

bq.  I didn't think that it broke rolling upgrade (you should still be able to 
upgrade from an earlier layout version to this one). Did I miss something?
My point was mainly about rolling downgrade. Just used upgrade/downgrade 
together in my comment because in my mind the expectations are the same.

bq. Do we actually support downgrade between 2.7 and 2.6? We changed the 
NameNode LayoutVersion, so I don't think so. These branches don't have 
HDFS-8432 either.
[~andrew.wang], tx for this info.

This is really unfortunate. Can you give a reference to the NameNode 
LayoutVersion change?

Did we ever establish clear rules about downgrades? We need to layout out our 
story around supporting downgrades continuously and codify it. I'd vote for 
keeping strict rules for downgrades too, otherwise users are left to fend for 
themselves in deciding the risk associated with every version upgrade - are we 
in a place where we can support this?

For upgrades, there is tribal knowledge amongst committers/reviewers in the 
minimum. And on YARN side, we've proposed (but made little progress) for tools 
to automatically catch some of it - YARN-3292.

To conclude, is the consensus to document all these downgrade related breakages 
but keep them in 2.7.x and 2.8?

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Fix For: 2.7.3
>
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, HDFS-8791-trunk-v3-bin.patch, 
> hadoop-56-layout-datanode-dir.tgz, test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8791) block ID-based DN storage layout can be very slow for datanode on ext4

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225026#comment-15225026
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8791:
---

I commented on the JIRA way back (see 
https://issues.apache.org/jira/browse/HDFS-8791?focusedCommentId=1503=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1503),
 saying what I said below. Unfortunately, I haven’t followed the patch along 
after my initial comment. 

This isn’t about any specific release - starting 2.6 we declared support for 
rolling upgrades and downgrades. Any patch that breaks this should not be in 
branch-2.

Two options from where I stand
 # For folks who worked on the patch: Is there a way to make (a) the 
upgrade-downgrade seamless for people who don’t care about this (b) and have 
explicit documentation for people who care to switch this behavior on and are 
willing to risk not having downgrades. If this means a new configuration 
property, so be it. It’s a necessary evil.
 # Just let specific users backport this into specific 2.x branches they need 
and leave it only on trunk.

Unless this behavior stops breaking rolling upgrades/downgrades, I think we 
should just revert it from branch-2 and definitely 2.7.3 as it stands today.

> block ID-based DN storage layout can be very slow for datanode on ext4
> --
>
> Key: HDFS-8791
> URL: https://issues.apache.org/jira/browse/HDFS-8791
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.6.0, 2.8.0, 2.7.1
>Reporter: Nathan Roberts
>Assignee: Chris Trezzo
>Priority: Blocker
> Fix For: 2.7.3
>
> Attachments: 32x32DatanodeLayoutTesting-v1.pdf, 
> 32x32DatanodeLayoutTesting-v2.pdf, HDFS-8791-trunk-v1.patch, 
> HDFS-8791-trunk-v2-bin.patch, HDFS-8791-trunk-v2.patch, 
> HDFS-8791-trunk-v2.patch, HDFS-8791-trunk-v3-bin.patch, 
> hadoop-56-layout-datanode-dir.tgz, test-node-upgrade.txt
>
>
> We are seeing cases where the new directory layout causes the datanode to 
> basically cause the disks to seek for 10s of minutes. This can be when the 
> datanode is running du, and it can also be when it is performing a 
> checkDirs(). Both of these operations currently scan all directories in the 
> block pool and that's very expensive in the new layout.
> The new layout creates 256 subdirs, each with 256 subdirs. Essentially 64K 
> leaf directories where block files are placed.
> So, what we have on disk is:
> - 256 inodes for the first level directories
> - 256 directory blocks for the first level directories
> - 256*256 inodes for the second level directories
> - 256*256 directory blocks for the second level directories
> - Then the inodes and blocks to store the the HDFS blocks themselves.
> The main problem is the 256*256 directory blocks. 
> inodes and dentries will be cached by linux and one can configure how likely 
> the system is to prune those entries (vfs_cache_pressure). However, ext4 
> relies on the buffer cache to cache the directory blocks and I'm not aware of 
> any way to tell linux to favor buffer cache pages (even if it did I'm not 
> sure I would want it to in general).
> Also, ext4 tries hard to spread directories evenly across the entire volume, 
> this basically means the 64K directory blocks are probably randomly spread 
> across the entire disk. A du type scan will look at directories one at a 
> time, so the ioscheduler can't optimize the corresponding seeks, meaning the 
> seeks will be random and far. 
> In a system I was using to diagnose this, I had 60K blocks. A DU when things 
> are hot is less than 1 second. When things are cold, about 20 minutes.
> How do things get cold?
> - A large set of tasks run on the node. This pushes almost all of the buffer 
> cache out, causing the next DU to hit this situation. We are seeing cases 
> where a large job can cause a seek storm across the entire cluster.
> Why didn't the previous layout see this?
> - It might have but it wasn't nearly as pronounced. The previous layout would 
> be a few hundred directory blocks. Even when completely cold, these would 
> only take a few a hundred seeks which would mean single digit seconds.  
> - With only a few hundred directories, the odds of the directory blocks 
> getting modified is quite high, this keeps those blocks hot and much less 
> likely to be evicted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8893) DNs with failed volumes stop serving during rolling upgrade

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225002#comment-15225002
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8893:
---

Tx for the quick response, [~shahrs87]!

> DNs with failed volumes stop serving during rolling upgrade
> ---
>
> Key: HDFS-8893
> URL: https://issues.apache.org/jira/browse/HDFS-8893
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Rushabh S Shah
>Assignee: Daryn Sharp
>Priority: Critical
>
> When a rolling upgrade starts, all DNs try to write a rolling_upgrade marker 
> to each of their volumes. If one of the volumes is bad, this will fail. When 
> this failure happens, the DN does not update the key it received from the NN.
> Unfortunately we had one failed volume on all the 3 datanodes which were 
> having replica.
> Keys expire after 20 hours so at about 20 hours into the rolling upgrade, the 
> DNs with failed volumes will stop serving clients.
> Here is the stack trace on the datanode size:
> {noformat}
> 2015-08-11 07:32:28,827 [DataNode: heartbeating to 8020] WARN 
> datanode.DataNode: IOException in offerService
> java.io.IOException: Read-only file system
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:947)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setRollingUpgradeMarkers(BlockPoolSliceStorage.java:721)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.setRollingUpgradeMarker(DataStorage.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.setRollingUpgradeMarker(FsDatasetImpl.java:2357)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.signalRollingUpgrade(BPOfferService.java:480)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.handleRollingUpgradeStatus(BPServiceActor.java:626)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:677)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:833)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8871) Decommissioning of a node with a failed volume may not start

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224994#comment-15224994
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8871:
---

[~daryn] / [~kihwal], any update on this? Is this still a bug? Considering this 
for a 2.7.3 RC later this week. Thanks.


> Decommissioning of a node with a failed volume may not start
> 
>
> Key: HDFS-8871
> URL: https://issues.apache.org/jira/browse/HDFS-8871
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Kihwal Lee
>Assignee: Daryn Sharp
>Priority: Critical
>
> Since staleness may not be properly cleared, a node with a failed volume may 
> not actually get scanned for block replication. Nothing is being replicated 
> from these nodes.
> This bug does not manifest unless the datanode has a unique storage ID per 
> volume. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8893) DNs with failed volumes stop serving during rolling upgrade

2016-04-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224988#comment-15224988
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8893:
---

[~daryn] / [~shahrs87], any progress on this? Is this this still a bug? 
Considering this for a 2.7.3 RC later this week.


> DNs with failed volumes stop serving during rolling upgrade
> ---
>
> Key: HDFS-8893
> URL: https://issues.apache.org/jira/browse/HDFS-8893
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Rushabh S Shah
>Assignee: Daryn Sharp
>Priority: Critical
>
> When a rolling upgrade starts, all DNs try to write a rolling_upgrade marker 
> to each of their volumes. If one of the volumes is bad, this will fail. When 
> this failure happens, the DN does not update the key it received from the NN.
> Unfortunately we had one failed volume on all the 3 datanodes which were 
> having replica.
> Keys expire after 20 hours so at about 20 hours into the rolling upgrade, the 
> DNs with failed volumes will stop serving clients.
> Here is the stack trace on the datanode size:
> {noformat}
> 2015-08-11 07:32:28,827 [DataNode: heartbeating to 8020] WARN 
> datanode.DataNode: IOException in offerService
> java.io.IOException: Read-only file system
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> at java.io.File.createNewFile(File.java:947)
> at 
> org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setRollingUpgradeMarkers(BlockPoolSliceStorage.java:721)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.setRollingUpgradeMarker(DataStorage.java:173)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.setRollingUpgradeMarker(FsDatasetImpl.java:2357)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.signalRollingUpgrade(BPOfferService.java:480)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.handleRollingUpgradeStatus(BPServiceActor.java:626)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:677)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:833)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9263) tests are using /test/build/data; breaking Jenkins

2016-03-30 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219123#comment-15219123
 ] 

Vinod Kumar Vavilapalli commented on HDFS-9263:
---

Either ways, can you please see if this can get into 2.7.3 within a week? Tx.

> tests are using /test/build/data; breaking Jenkins
> --
>
> Key: HDFS-9263
> URL: https://issues.apache.org/jira/browse/HDFS-9263
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 3.0.0
> Environment: Jenkins
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Blocker
> Attachments: HDFS-9263-001.patch, HDFS-9263-002.patch
>
>
> Some of the HDFS tests are using the path {{test/build/data}} to store files, 
> so leaking files which fail the new post-build RAT test checks on Jenkins 
> (and dirtying all development systems with paths which {{mvn clean}} will 
> miss.
> fix



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8674) Improve performance of postponed block scans

2016-03-30 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219121#comment-15219121
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8674:
---

[~daryn] / [~mingma], is it possible to get this in 2.7.3 in a week? Tx?

> Improve performance of postponed block scans
> 
>
> Key: HDFS-8674
> URL: https://issues.apache.org/jira/browse/HDFS-8674
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-8674.patch, HDFS-8674.patch
>
>
> When a standby goes active, it marks all nodes as "stale" which will cause 
> block invalidations for over-replicated blocks to be queued until full block 
> reports are received from the nodes with the block.  The replication monitor 
> scans the queue with O(N) runtime.  It picks a random offset and iterates 
> through the set to randomize blocks scanned.
> The result is devastating when a cluster loses multiple nodes during a 
> rolling upgrade. Re-replication occurs, the nodes come back, the excess block 
> invalidations are postponed. Rescanning just 2k blocks out of millions of 
> postponed blocks may take multiple seconds. During the scan, the write lock 
> is held which stalls all other processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8498) Blocks can be committed with wrong size

2016-03-30 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15219057#comment-15219057
 ] 

Vinod Kumar Vavilapalli commented on HDFS-8498:
---

[~zhz] / [~kihwal] / [~daryn], I think we should close this as Won't fix or as 
a dup of HDFS-9289.

This bug keeps appearing in the blocker/critical list for releases, but we 
don't seem to be progressing.


> Blocks can be committed with wrong size
> ---
>
> Key: HDFS-8498
> URL: https://issues.apache.org/jira/browse/HDFS-8498
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.5.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> When an IBR for a UC block arrives, the NN updates the expected location's 
> block and replica state _only_ if it's on an unexpected storage for an 
> expected DN.  If it's for an expected storage, only the genstamp is updated.  
> When the block is committed, and the expected locations are verified, only 
> the genstamp is checked.  The size is not checked but it wasn't updated in 
> the expected locations anyway.
> A faulty client may misreport the size when committing the block.  The block 
> is effectively corrupted.  If the NN issues replications, the received IBR is 
> considered corrupt, the NN invalidates the block, immediately issues another 
> replication.  The NN eventually realizes all the original replicas are 
> corrupt after full BRs are received from the original DNs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8817) Make StorageType for Volumes in DataNode visible through JMX

2016-02-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8817:
--
Fix Version/s: (was: 2.8.0)

> Make StorageType for Volumes in DataNode visible through JMX
> 
>
> Key: HDFS-8817
> URL: https://issues.apache.org/jira/browse/HDFS-8817
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Anu Engineer
>Assignee: Anu Engineer
> Attachments: HDFS-8817.001.patch
>
>
> StorageTypes are part of Volumes on DataNodes. Right now VolumeInfo does not 
> contain the StorageType Info in the {{VolumeInfo}}.  This JIRA proposes to 
> expose that info through VolumeInfo JSON.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8344) NameNode doesn't recover lease for files with missing blocks

2016-02-04 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8344:
--
Fix Version/s: (was: 2.8.0)

Removing fix-version given the patch was reverted and waiting final-commit.

> NameNode doesn't recover lease for files with missing blocks
> 
>
> Key: HDFS-8344
> URL: https://issues.apache.org/jira/browse/HDFS-8344
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Ravi Prakash
>Assignee: Ravi Prakash
> Attachments: HDFS-8344.01.patch, HDFS-8344.02.patch, 
> HDFS-8344.03.patch, HDFS-8344.04.patch, HDFS-8344.05.patch, 
> HDFS-8344.06.patch, HDFS-8344.07.patch, HDFS-8344.08.patch, 
> HDFS-8344.09.patch, HDFS-8344.10.patch, TestHadoop.java
>
>
> I found another\(?) instance in which the lease is not recovered. This is 
> reproducible easily on a pseudo-distributed single node cluster
> # Before you start it helps if you set. This is not necessary, but simply 
> reduces how long you have to wait
> {code}
>   public static final long LEASE_SOFTLIMIT_PERIOD = 30 * 1000;
>   public static final long LEASE_HARDLIMIT_PERIOD = 2 * 
> LEASE_SOFTLIMIT_PERIOD;
> {code}
> # Client starts to write a file. (could be less than 1 block, but it hflushed 
> so some of the data has landed on the datanodes) (I'm copying the client code 
> I am using. I generate a jar and run it using $ hadoop jar TestHadoop.jar)
> # Client crashes. (I simulate this by kill -9 the $(hadoop jar 
> TestHadoop.jar) process after it has printed "Wrote to the bufferedWriter"
> # Shoot the datanode. (Since I ran on a pseudo-distributed cluster, there was 
> only 1)
> I believe the lease should be recovered and the block should be marked 
> missing. However this is not happening. The lease is never recovered.
> The effect of this bug for us was that nodes could not be decommissioned 
> cleanly. Although we knew that the client had crashed, the Namenode never 
> released the leases (even after restarting the Namenode) (even months 
> afterwards). There are actually several other cases too where we don't 
> consider what happens if ALL the datanodes die while the file is being 
> written, but I am going to punt on that for another time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9666) Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to improve random read

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9666:
--
Fix Version/s: (was: 2.7.2)

[~aderen], please use Target Version for your intention and leave out the 
fix-version for committers to fill in at commit time. FYI, fixed this JIRA 
myself. Tx.

> Enable hdfs-client to read even remote SSD/RAM prior to local disk replica to 
> improve random read
> -
>
> Key: HDFS-9666
> URL: https://issues.apache.org/jira/browse/HDFS-9666
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs-client
>Affects Versions: 2.6.0, 2.7.0
>Reporter: ade
>Assignee: ade
> Attachments: HDFS-9666.0.patch
>
>
> We want to improve random read performance of HDFS for HBase, so enabled the 
> heterogeneous storage in our cluster. But there are only ~50% of datanode & 
> regionserver hosts with SSD. we can set hfile with only ONE_SSD not ALL_SSD 
> storagepolicy and the regionserver on none-SSD host can only read the local 
> disk replica . So we developed this feature in hdfs client to read even 
> remote SSD/RAM prior to local disk replica.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-815) FileContext tests fail on Windows

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-815:
-
Fix Version/s: (was: 2.7.2)

> FileContext tests fail on Windows
> -
>
> Key: HDFS-815
> URL: https://issues.apache.org/jira/browse/HDFS-815
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.21.0
> Environment: Windows
>Reporter: Konstantin Shvachko
>
> The following FileContext-related tests are failing on windows because of 
> incorrect use "test.build.data" system property for setting hdfs paths, which 
> end up containing "C:" as a path component, which hdfs does not support.
> {code}
> org.apache.hadoop.fs.TestFcHdfsCreateMkdir
> org.apache.hadoop.fs.TestFcHdfsPermission
> org.apache.hadoop.fs.TestHDFSFileContextMainOperations
> org.apache.hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8676) Delayed rolling upgrade finalization can cause heartbeat expiration and write failures

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8676:
--
Fix Version/s: (was: 3.0.0)

> Delayed rolling upgrade finalization can cause heartbeat expiration and write 
> failures
> --
>
> Key: HDFS-8676
> URL: https://issues.apache.org/jira/browse/HDFS-8676
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Walter Su
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-8676.01.patch, HDFS-8676.02.patch
>
>
> In big busy clusters where the deletion rate is also high, a lot of blocks 
> can pile up in the datanode trash directories until an upgrade is finalized.  
> When it is finally finalized, the deletion of trash is done in the service 
> actor thread's context synchronously.  This blocks the heartbeat and can 
> cause heartbeat expiration.  
> We have seen a namenode losing hundreds of nodes after a delayed upgrade 
> finalization.  The deletion of trash directories should be made asynchronous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8950) NameNode refresh doesn't remove DataNodes that are no longer in the allowed list

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8950:
--
Fix Version/s: (was: 3.0.0)

> NameNode refresh doesn't remove DataNodes that are no longer in the allowed 
> list
> 
>
> Key: HDFS-8950
> URL: https://issues.apache.org/jira/browse/HDFS-8950
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 2.6.0
>Reporter: Daniel Templeton
>Assignee: Daniel Templeton
>  Labels: 2.7.2-candidate
> Fix For: 2.7.2
>
> Attachments: HDFS-8950.001.patch, HDFS-8950.002.patch, 
> HDFS-8950.003.patch, HDFS-8950.004.patch, HDFS-8950.005.patch, 
> HDFS-8950.branch-2.7.patch
>
>
> If you remove a DN from NN's allowed host list (HDFS was HA) and then do NN 
> refresh, it doesn't remove it actually and the NN UI keeps showing that node. 
> It may try to allocate some blocks to that DN as well during an MR job.  This 
> issue is independent from DN decommission.
> To reproduce:
> 1. Add a DN to dfs_hosts_allow
> 2. Refresh NN
> 3. Start DN. Now NN starts seeing DN.
> 4. Stop DN
> 5. Remove DN from dfs_hosts_allow
> 6. Refresh NN -> NN is still reporting DN as being used by HDFS.
> This is different from decom because there DN is added to exclude list in 
> addition to being removed from allowed list, and in that case everything 
> works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9220) Reading small file (< 512 bytes) that is open for append fails due to incorrect checksum

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9220:
--
Fix Version/s: (was: 3.0.0)

> Reading small file (< 512 bytes) that is open for append fails due to 
> incorrect checksum
> 
>
> Key: HDFS-9220
> URL: https://issues.apache.org/jira/browse/HDFS-9220
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Bogdan Raducanu
>Assignee: Jing Zhao
>Priority: Blocker
> Fix For: 2.7.2
>
> Attachments: HDFS-9220.000.patch, HDFS-9220.001.patch, 
> HDFS-9220.002.patch, test2.java
>
>
> Exception:
> 2015-10-09 14:59:40 WARN  DFSClient:1150 - fetchBlockByteRange(). Got a 
> checksum exception for /tmp/file0.05355529331575182 at 
> BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882:0 from 
> DatanodeInfoWithStorage[10.10.10.10]:5001
> All 3 replicas cause this exception and the read fails entirely with:
> BlockMissingException: Could not obtain block: 
> BP-353681639-10.10.10.10-1437493596883:blk_1075692769_9244882 
> file=/tmp/file0.05355529331575182
> Code to reproduce is attached.
> Does not happen in 2.7.0.
> Data is read correctly if checksum verification is disabled.
> More generally, the failure happens when reading from the last block of a 
> file and the last block has <= 512 bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8850) VolumeScanner thread exits with exception if there is no block pool to be scanned but there are suspicious blocks

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8850:
--
Fix Version/s: (was: 3.0.0)

> VolumeScanner thread exits with exception if there is no block pool to be 
> scanned but there are suspicious blocks
> -
>
> Key: HDFS-8850
> URL: https://issues.apache.org/jira/browse/HDFS-8850
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.0
>Reporter: Colin Patrick McCabe
>Assignee: Colin Patrick McCabe
> Fix For: 2.7.2
>
> Attachments: HDFS-8850.001.patch
>
>
> The VolumeScanner threads inside the BlockScanner exit with an exception if 
> there is no block pool to be scanned but there are suspicious blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8879) Quota by storage type usage incorrectly initialized upon namenode restart

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8879:
--
Fix Version/s: (was: 3.0.0)

> Quota by storage type usage incorrectly initialized upon namenode restart
> -
>
> Key: HDFS-8879
> URL: https://issues.apache.org/jira/browse/HDFS-8879
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Kihwal Lee
>Assignee: Xiaoyu Yao
> Fix For: 2.7.2
>
> Attachments: HDFS-8879.01.patch
>
>
> This was found by [~kihwal] as part of HDFS-8865 work in this 
> [comment|https://issues.apache.org/jira/browse/HDFS-8865?focusedCommentId=14660904=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14660904].
> The unit test 
> testQuotaByStorageTypePersistenceInFsImage/testQuotaByStorageTypePersistenceInFsEdit
>  failed to detect this because they were using an obsolete
> FsDirectory instance. Once added the highlighted line below, the issue can be 
> reproed.
> {code}
> >fsdir = cluster.getNamesystem().getFSDirectory();
> INode testDirNodeAfterNNRestart = fsdir.getINode4Write(testDir.toString());
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7645) Rolling upgrade is restoring blocks from trash multiple times

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7645:
--
Fix Version/s: (was: 3.0.0)

> Rolling upgrade is restoring blocks from trash multiple times
> -
>
> Key: HDFS-7645
> URL: https://issues.apache.org/jira/browse/HDFS-7645
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.6.0
>Reporter: Nathan Roberts
>Assignee: Keisuke Ogiwara
> Fix For: 2.7.2
>
> Attachments: HDFS-7645.01.patch, HDFS-7645.02.patch, 
> HDFS-7645.03.patch, HDFS-7645.04.patch, HDFS-7645.05.patch, 
> HDFS-7645.06.patch, HDFS-7645.07.patch
>
>
> When performing an HDFS rolling upgrade, the trash directory is getting 
> restored twice when under normal circumstances it shouldn't need to be 
> restored at all. iiuc, the only time these blocks should be restored is if we 
> need to rollback a rolling upgrade. 
> On a busy cluster, this can cause significant and unnecessary block churn 
> both on the datanodes, and more importantly in the namenode.
> The two times this happens are:
> 1) restart of DN onto new software
> {code}
>   private void doTransition(DataNode datanode, StorageDirectory sd,
>   NamespaceInfo nsInfo, StartupOption startOpt) throws IOException {
> if (startOpt == StartupOption.ROLLBACK && sd.getPreviousDir().exists()) {
>   Preconditions.checkState(!getTrashRootDir(sd).exists(),
>   sd.getPreviousDir() + " and " + getTrashRootDir(sd) + " should not 
> " +
>   " both be present.");
>   doRollback(sd, nsInfo); // rollback if applicable
> } else {
>   // Restore all the files in the trash. The restored files are retained
>   // during rolling upgrade rollback. They are deleted during rolling
>   // upgrade downgrade.
>   int restored = restoreBlockFilesFromTrash(getTrashRootDir(sd));
>   LOG.info("Restored " + restored + " block files from trash.");
> }
> {code}
> 2) When heartbeat response no longer indicates a rollingupgrade is in progress
> {code}
>   /**
>* Signal the current rolling upgrade status as indicated by the NN.
>* @param inProgress true if a rolling upgrade is in progress
>*/
>   void signalRollingUpgrade(boolean inProgress) throws IOException {
> String bpid = getBlockPoolId();
> if (inProgress) {
>   dn.getFSDataset().enableTrash(bpid);
>   dn.getFSDataset().setRollingUpgradeMarker(bpid);
> } else {
>   dn.getFSDataset().restoreTrash(bpid);
>   dn.getFSDataset().clearRollingUpgradeMarker(bpid);
> }
>   }
> {code}
> HDFS-6800 and HDFS-6981 were modifying this behavior making it not completely 
> clear whether this is somehow intentional. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9221) HdfsServerConstants#ReplicaState#getState should avoid calling values() since it creates a temporary array

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9221:
--
Fix Version/s: (was: 3.0.0)

> HdfsServerConstants#ReplicaState#getState should avoid calling values() since 
> it creates a temporary array
> --
>
> Key: HDFS-9221
> URL: https://issues.apache.org/jira/browse/HDFS-9221
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: performance
>Affects Versions: 2.7.1
>Reporter: Staffan Friberg
>Assignee: Staffan Friberg
> Fix For: 2.7.2
>
> Attachments: HADOOP-9221.001.patch
>
>
> When the BufferDecoder in BlockListAsLongs converts the stored value to a 
> ReplicaState enum it calls ReplicaState.getState(int) unfortunately this 
> method creates a ReplicaState[] for each call since it calls 
> ReplicaState.values().
> This patch creates a cached version of the values and thus avoid all 
> allocation when doing the conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8656) Preserve compatibility of ClientProtocol#rollingUpgrade after finalization

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8656:
--
Fix Version/s: (was: 3.0.0)

> Preserve compatibility of ClientProtocol#rollingUpgrade after finalization
> --
>
> Key: HDFS-8656
> URL: https://issues.apache.org/jira/browse/HDFS-8656
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: rolling upgrades
>Affects Versions: 2.8.0
>Reporter: Andrew Wang
>Assignee: Andrew Wang
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: hdfs-8656.001.patch, hdfs-8656.002.patch, 
> hdfs-8656.003.patch, hdfs-8656.004.patch
>
>
> HDFS-7645 changed rollingUpgradeInfo to still return an RUInfo after 
> finalization, so the DNs can differentiate between rollback and a 
> finalization. However, this breaks compatibility for the user facing APIs, 
> which always expect a null after finalization. Let's fix this and edify it in 
> unit tests.
> As an additional improvement, isFinalized and isStarted are part of the Java 
> API, but not in the JMX output of RollingUpgradeInfo. It'd be nice to expose 
> these booleans so JMX users don't need to do the != 0 check that possibly 
> exposes our implementation details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9178) Slow datanode I/O can cause a wrong node to be marked bad

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9178:
--
Fix Version/s: (was: 3.0.0)

> Slow datanode I/O can cause a wrong node to be marked bad
> -
>
> Key: HDFS-9178
> URL: https://issues.apache.org/jira/browse/HDFS-9178
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.7.2
>
> Attachments: HDFS-9178.branch-2.6.patch, HDFS-9178.patch
>
>
> When non-leaf datanode in a pipeline is slow on or stuck at disk I/O, the 
> downstream node can timeout on reading packet since even the heartbeat 
> packets will not be relayed down.  
> The packet read timeout is set in {{DataXceiver#run()}}:
> {code}
>   peer.setReadTimeout(dnConf.socketTimeout);
> {code}
> When the downstream node times out and closes the connection to the upstream, 
> the upstream node's {{PacketResponder}} gets {{EOFException}} and it sends an 
> ack upstream with the downstream node status set to {{ERROR}}.  This caused 
> the client to exclude the downstream node, even thought the upstream node was 
> the one got stuck.
> The connection to downstream has longer timeout, so the downstream will 
> always timeout  first. The downstream timeout is set in {{writeBlock()}}
> {code}
>   int timeoutValue = dnConf.socketTimeout +
>   (HdfsConstants.READ_TIMEOUT_EXTENSION * targets.length);
>   int writeTimeout = dnConf.socketWriteTimeout +
>   (HdfsConstants.WRITE_TIMEOUT_EXTENSION * targets.length);
>   NetUtils.connect(mirrorSock, mirrorTarget, timeoutValue);
>   OutputStream unbufMirrorOut = NetUtils.getOutputStream(mirrorSock,
>   writeTimeout);
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9290) DFSClient#callAppend() is not backward compatible for slightly older NameNodes

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9290:
--
Fix Version/s: (was: 3.0.0)

> DFSClient#callAppend() is not backward compatible for slightly older NameNodes
> --
>
> Key: HDFS-9290
> URL: https://issues.apache.org/jira/browse/HDFS-9290
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Tony Wu
>Assignee: Tony Wu
>Priority: Blocker
> Fix For: 2.7.2
>
> Attachments: HDFS-9290.001.patch, HDFS-9290.002.patch
>
>
> HDFS-7210 combined 2 RPC calls used at file append into a single one. 
> Specifically {{getFileInfo()}} is combined with {{append()}}. While backward 
> compatibility for older client is handled by the new NameNode (protobuf). 
> Newer client's {{append()}} call does not work with older NameNodes. One will 
> run into an exception like the following:
> {code:java}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.isLazyPersist(DFSOutputStream.java:1741)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.getChecksum4Compute(DFSOutputStream.java:1550)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1560)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.(DFSOutputStream.java:1670)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.newStreamForAppend(DFSOutputStream.java:1717)
> at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1861)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1922)
> at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1892)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:340)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:336)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:318)
> at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1164)
> {code}
> The cause is that the new client code is expecting both the last block and 
> file info in the same RPC but the old NameNode only replied with the first. 
> The exception itself does not reflect this and one will have to look at the 
> HDFS source code to really understand what happened.
> We can have the client detect it's talking to a old NameNode and send an 
> extra {{getFileInfo()}} RPC. Or we should improve the exception being thrown 
> to accurately reflect the cause of failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9445) Datanode may deadlock while handling a bad volume

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9445:
--
Fix Version/s: (was: 3.0.0)

> Datanode may deadlock while handling a bad volume
> -
>
> Key: HDFS-9445
> URL: https://issues.apache.org/jira/browse/HDFS-9445
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Kihwal Lee
>Assignee: Walter Su
>Priority: Blocker
> Fix For: 2.7.2, 2.6.4
>
> Attachments: HDFS-9445-branch-2.6.02.patch, 
> HDFS-9445-branch-2.6_02.patch, HDFS-9445.00.patch, HDFS-9445.01.patch, 
> HDFS-9445.02.patch
>
>
> {noformat}
> Found one Java-level deadlock:
> =
> "DataXceiver for client DFSClient_attempt_xxx at /1.2.3.4:100 [Sending block 
> BP-x:blk_123_456]":
>   waiting to lock monitor 0x7f77d0731768 (object 0xd60d9930, a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl),
>   which is held by "Thread-565"
> "Thread-565":
>   waiting for ownable synchronizer 0xd55613c8, (a 
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync),
>   which is held by "DataNode: heartbeating to my-nn:8020"
> "DataNode: heartbeating to my-nn:8020":
>   waiting to lock monitor 0x7f77d0731768 (object 0xd60d9930, a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl),
>   which is held by "Thread-565"
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8384) Allow NN to startup if there are files having a lease but are not under construction

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8384:
--
Fix Version/s: (was: 2.8.0)

> Allow NN to startup if there are files having a lease but are not under 
> construction
> 
>
> Key: HDFS-8384
> URL: https://issues.apache.org/jira/browse/HDFS-8384
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Jing Zhao
>Priority: Minor
>  Labels: 2.6.1-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-8384-branch-2.6.patch, HDFS-8384-branch-2.7.patch, 
> HDFS-8384.000.patch
>
>
> When there are files having a lease but are not under construction, NN will 
> fail to start up with
> {code}
> 15/05/12 00:36:31 ERROR namenode.FSImage: Unable to save image for 
> /hadoop/hdfs/namenode
> java.lang.IllegalStateException
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:129)
> at 
> org.apache.hadoop.hdfs.server.namenode.LeaseManager.getINodesUnderConstruction(LeaseManager.java:412)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getFilesUnderConstruction(FSNamesystem.java:7124)
> ...
> {code}
> The actually problem is that the image could be corrupted by bugs like 
> HDFS-7587.  We should have an option/conf to allow NN to start up so that the 
> problematic files could possibly be deleted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7609) Avoid retry cache collision when Standby NameNode loading edits

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7609:
--
Fix Version/s: (was: 2.8.0)

> Avoid retry cache collision when Standby NameNode loading edits
> ---
>
> Key: HDFS-7609
> URL: https://issues.apache.org/jira/browse/HDFS-7609
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.2.0
>Reporter: Carrey Zhan
>Assignee: Ming Ma
>Priority: Critical
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-7609-2.patch, HDFS-7609-3.patch, 
> HDFS-7609-CreateEditsLogWithRPCIDs.patch, HDFS-7609-branch-2.7.2.txt, 
> HDFS-7609.patch, recovery_do_not_use_retrycache.patch
>
>
> One day my namenode crashed because of two journal node timed out at the same 
> time under very high load, leaving behind about 100 million transactions in 
> edits log.(I still have no idea why they were not rolled into fsimage.)
> I tryed to restart namenode, but it showed that almost 20 hours would be 
> needed before finish, and it was loading fsedits most of the time. I also 
> tryed to restart namenode in recover mode, the loading speed had no different.
> I looked into the stack trace, judged that it is caused by the retry cache. 
> So I set dfs.namenode.enable.retrycache to false, the restart process 
> finished in half an hour.
> I think the retry cached is useless during startup, at least during recover 
> process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8846) Add a unit test for INotify functionality across a layout version upgrade

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8846:
--
Fix Version/s: (was: 2.8.0)

> Add a unit test for INotify functionality across a layout version upgrade
> -
>
> Key: HDFS-8846
> URL: https://issues.apache.org/jira/browse/HDFS-8846
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Zhe Zhang
>Assignee: Zhe Zhang
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-8846-branch-2.6.1.txt, HDFS-8846.00.patch, 
> HDFS-8846.01.patch, HDFS-8846.02.patch, HDFS-8846.03.patch
>
>
> Per discussion under HDFS-8480, we should create some edit log files with old 
> layout version, to test whether they can be correctly handled in upgrades.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9294) DFSClient deadlock when close file and failed to renew lease

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9294:
--
Fix Version/s: (was: 2.8.0)

> DFSClient  deadlock when close file and failed to renew lease
> -
>
> Key: HDFS-9294
> URL: https://issues.apache.org/jira/browse/HDFS-9294
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.2.0, 2.7.1
> Environment: Hadoop 2.2.0
>Reporter: DENG FEI
>Assignee: Brahma Reddy Battula
>Priority: Blocker
> Fix For: 2.7.2, 2.6.4
>
> Attachments: HDFS-9294-002.patch, HDFS-9294-002.patch, 
> HDFS-9294-branch-2.6.patch, HDFS-9294-branch-2.7.patch, 
> HDFS-9294-branch-2.patch, HDFS-9294.patch
>
>
> We found a deadlock at our HBase(0.98) cluster(and the Hadoop Version is 
> 2.2.0),and it should be HDFS BUG,at the time our network is not stable.
>  below is the stack:
> *
> Found one Java-level deadlock:
> =
> "MemStoreFlusher.1":
>   waiting to lock monitor 0x7ff27cfa5218 (object 0x0002fae5ebe0, a 
> org.apache.hadoop.hdfs.LeaseRenewer),
>   which is held by "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel"
> "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel":
>   waiting to lock monitor 0x7ff2e67e16a8 (object 0x000486ce6620, a 
> org.apache.hadoop.hdfs.DFSOutputStream),
>   which is held by "MemStoreFlusher.0"
> "MemStoreFlusher.0":
>   waiting to lock monitor 0x7ff27cfa5218 (object 0x0002fae5ebe0, a 
> org.apache.hadoop.hdfs.LeaseRenewer),
>   which is held by "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel"
> Java stack information for the threads listed above:
> ===
> "MemStoreFlusher.1":
>   at org.apache.hadoop.hdfs.LeaseRenewer.addClient(LeaseRenewer.java:216)
>   - waiting to lock <0x0002fae5ebe0> (a 
> org.apache.hadoop.hdfs.LeaseRenewer)
>   at org.apache.hadoop.hdfs.LeaseRenewer.getInstance(LeaseRenewer.java:81)
>   at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:648)
>   at org.apache.hadoop.hdfs.DFSClient.endFileLease(DFSClient.java:659)
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1882)
>   - locked <0x00055b606cb0> (a org.apache.hadoop.hdfs.DFSOutputStream)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:71)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:104)
>   at 
> org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:250)
>   at 
> org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:402)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:974)
>   at 
> org.apache.hadoop.hbase.regionserver.StoreFlusher.finalizeWriter(StoreFlusher.java:78)
>   at 
> org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75)
>   - locked <0x00059869eed8> (a java.lang.Object)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:812)
>   at 
> org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:1974)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1795)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1678)
>   at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1591)
>   at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:472)
>   at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:211)
>   at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$500(MemStoreFlusher.java:66)
>   at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:238)
>   at java.lang.Thread.run(Thread.java:744)
> "LeaseRenewer:hbaseadmin@hbase-ns-gdt-sh-marvel":
>   at 
> org.apache.hadoop.hdfs.DFSOutputStream.abort(DFSOutputStream.java:1822)
>   - waiting to lock <0x000486ce6620> (a 
> org.apache.hadoop.hdfs.DFSOutputStream)
>   at 
> org.apache.hadoop.hdfs.DFSClient.closeAllFilesBeingWritten(DFSClient.java:780)
>   at org.apache.hadoop.hdfs.DFSClient.abort(DFSClient.java:753)
>   at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:453)
>   - locked <0x0002fae5ebe0> (a org.apache.hadoop.hdfs.LeaseRenewer)
>   at 

[jira] [Updated] (HDFS-9273) ACLs on root directory may be lost after NN restart

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-9273:
--
Fix Version/s: (was: 2.8.0)

> ACLs on root directory may be lost after NN restart
> ---
>
> Key: HDFS-9273
> URL: https://issues.apache.org/jira/browse/HDFS-9273
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.1
>Reporter: Xiao Chen
>Assignee: Xiao Chen
>Priority: Critical
> Fix For: 2.7.2, 2.6.3
>
> Attachments: HDFS-9273.001.patch, HDFS-9273.002.patch
>
>
> After restarting namenode, the ACLs on the root directory ("/") may be lost 
> if it's rolled over to fsimage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8219) setStoragePolicy with folder behavior is different after cluster restart

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8219:
--
Fix Version/s: (was: 2.8.0)

> setStoragePolicy with folder behavior is different after cluster restart
> 
>
> Key: HDFS-8219
> URL: https://issues.apache.org/jira/browse/HDFS-8219
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Peter Shi
>Assignee: Surendra Singh Lilhore
>  Labels: 2.6.1-candidate, 2.7.2-candidate, BB2015-05-RFC
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-8219.patch, HDFS-8219.unittest-norepro.patch
>
>
> Reproduce steps.
> 1) mkdir named /temp
> 2) put one file A under /temp
> 3) change /temp storage policy to COLD
> 4) use -getStoragePolicy to query file A's storage policy, it is same with 
> /temp
> 5) change /temp folder storage policy again, will see file A's storage policy 
> keep same with parent folder.
> then restart the cluster.
> do 3) 4) again, will find file A's storage policy is not change while parent 
> folder's storage policy changes. It behaves different.
> As i debugged, found the code:
> in INodeFile.getStoragePolicyID
> {code}
>   public byte getStoragePolicyID() {
> byte id = getLocalStoragePolicyID();
> if (id == BLOCK_STORAGE_POLICY_ID_UNSPECIFIED) {
>   return this.getParent() != null ?
>   this.getParent().getStoragePolicyID() : id;
> }
> return id;
>   }
> {code}
> If the file do not have its storage policy, it will use parent's. But after 
> cluster restart, the file turns to have its own storage policy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8431) hdfs crypto class not found in Windows

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8431:
--
Fix Version/s: (was: 2.8.0)

> hdfs crypto class not found in Windows
> --
>
> Key: HDFS-8431
> URL: https://issues.apache.org/jira/browse/HDFS-8431
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: scripts
>Affects Versions: 2.6.0
> Environment: Windows only
>Reporter: Sumana Sathish
>Assignee: Anu Engineer
>Priority: Critical
>  Labels: 2.6.1-candidate, 2.7.2-candidate, encryption, scripts, 
> windows
> Fix For: 2.6.1, 2.7.2
>
> Attachments: Screen Shot 2015-05-18 at 6.27.11 PM.png, 
> hdfs-8431.001.patch, hdfs-8431.002.patch
>
>
> Attached screenshot shows that hdfs could not find class 'crypto' for Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7314) When the DFSClient lease cannot be renewed, abort open-for-write files rather than the entire DFSClient

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-7314:
--
Fix Version/s: (was: 2.8.0)

> When the DFSClient lease cannot be renewed, abort open-for-write files rather 
> than the entire DFSClient
> ---
>
> Key: HDFS-7314
> URL: https://issues.apache.org/jira/browse/HDFS-7314
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ming Ma
>Assignee: Ming Ma
>  Labels: 2.6.1-candidate, 2.7.2-candidate, BB2015-05-TBR
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-7314-2.patch, HDFS-7314-3.patch, HDFS-7314-4.patch, 
> HDFS-7314-5.patch, HDFS-7314-6.patch, HDFS-7314-7.patch, HDFS-7314-8.patch, 
> HDFS-7314-9.patch, HDFS-7314-branch-2.7.2.txt, HDFS-7314.patch
>
>
> It happened in YARN nodemanger scenario. But it could happen to any long 
> running service that use cached instance of DistrbutedFileSystem.
> 1. Active NN is under heavy load. So it became unavailable for 10 minutes; 
> any DFSClient request will get ConnectTimeoutException.
> 2. YARN nodemanager use DFSClient for certain write operation such as log 
> aggregator or shared cache in YARN-1492. DFSClient used by YARN NM's 
> renewLease RPC got ConnectTimeoutException.
> {noformat}
> 2014-10-29 01:36:19,559 WARN org.apache.hadoop.hdfs.LeaseRenewer: Failed to 
> renew lease for [DFSClient_NONMAPREDUCE_-550838118_1] for 372 seconds.  
> Aborting ...
> {noformat}
> 3. After DFSClient is in Aborted state, YARN NM can't use that cached 
> instance of DistributedFileSystem.
> {noformat}
> 2014-10-29 20:26:23,991 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
>  Failed to download rsrc...
> java.io.IOException: Filesystem closed
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:727)
> at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1780)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1124)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1120)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:237)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:340)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:57)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We can make YARN or DFSClient more tolerant to temporary NN unavailability. 
> Given the callstack is YARN -> DistributedFileSystem -> DFSClient, this can 
> be addressed at different layers.
> * YARN closes the DistributedFileSystem object when it receives some well 
> defined exception. Then the next HDFS call will create a new instance of 
> DistributedFileSystem. We have to fix all the places in YARN. Plus other HDFS 
> applications need to address this as well.
> * DistributedFileSystem detects Aborted DFSClient and create a new instance 
> of DFSClient. We will need to fix all the places DistributedFileSystem calls 
> DFSClient.
> * After DFSClient gets into Aborted state, it doesn't have to reject all 
> requests , instead it can retry. If NN is available again it can transition 
> to healthy state.
> Comments?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8046) Allow better control of getContentSummary

2016-01-26 Thread Vinod Kumar Vavilapalli (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HDFS-8046:
--
Fix Version/s: (was: 2.8.0)

> Allow better control of getContentSummary
> -
>
> Key: HDFS-8046
> URL: https://issues.apache.org/jira/browse/HDFS-8046
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>  Labels: 2.6.1-candidate, 2.7.2-candidate
> Fix For: 2.6.1, 2.7.2
>
> Attachments: HDFS-8046-branch-2.6.1.txt, HDFS-8046.v1.patch
>
>
> On busy clusters, users performing quota checks against a big directory 
> structure can affect the namenode performance. It has become a lot better 
> after HDFS-4995, but as clusters get bigger and busier, it is apparent that 
> we need finer grain control to avoid long read lock causing throughput drop.
> Even with unfair namesystem lock setting, a long read lock (10s of 
> milliseconds) can starve many readers and especially writers. So the locking 
> duration should be reduced, which can be done by imposing a lower 
> count-per-iteration limit in the existing implementation.  But HDFS-4995 came 
> with a fixed amount of sleep between locks. This needs to be made 
> configurable, so that {{getContentSummary()}} doesn't get exceedingly slow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >