[jira] [Comment Edited] (HDFS-16028) Add a configuration item for special trash dir

2021-05-19 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347406#comment-17347406
 ] 

Qi Zhu edited comment on HDFS-16028 at 5/19/21, 8:24 AM:
-

Thanks [~zhengzhuobinzzb] for patch.

1. We'd better add an enable flag to trigger this besides null check.

2. We should also add the new conf in core-default.xml. 

3. We should add some doc for getTrashHome method consistent with 
getHomeDirectory.

cc [~hexiaoqiao] [~ayushtkn]   [~weichiu]  [~sodonnell] 

Could you help review this when you are free?

 


was (Author: zhuqi):
Thanks [~zhengzhuobinzzb] for patch.

We should also add the new conf in core-default.xml.  And we should add some 
doc for getTrashHome method consistent with getHomeDirectory.

cc [~hexiaoqiao] [~ayushtkn]   [~weichiu]  [~sodonnell] 

Could you help review this when you are free?

 

> Add a configuration item for special trash dir
> --
>
> Key: HDFS-16028
> URL: https://issues.apache.org/jira/browse/HDFS-16028
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-16028.001.patch, HDFS-16028.002.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In some situation, We don't want to put trash in homedir. like:
>  # Immediately reduce the quota occupation of the home directory
>  # In RBF:  We want to make the directory mounting strategy of trash 
> different from the home directory and we don't want mount it per user
> This patch add the option "fs.trash.dir" to special the trash dir( 
> ${fs.trash.dir}/$USER/.Trash)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16028) Add a configuration item for special trash dir

2021-05-19 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347406#comment-17347406
 ] 

Qi Zhu commented on HDFS-16028:
---

Thanks [~zhengzhuobinzzb] for patch.

We should also add the new conf in core-default.xml.  And we should add some 
doc for getTrashHome method consistent with getHomeDirectory.

cc [~hexiaoqiao] [~ayushtkn]   [~weichiu]  [~sodonnell] 

Could you help review this when you are free?

 

> Add a configuration item for special trash dir
> --
>
> Key: HDFS-16028
> URL: https://issues.apache.org/jira/browse/HDFS-16028
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: zhuobin zheng
>Assignee: zhuobin zheng
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-16028.001.patch, HDFS-16028.002.patch
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In some situation, We don't want to put trash in homedir. like:
>  # Immediately reduce the quota occupation of the home directory
>  # In RBF:  We want to make the directory mounting strategy of trash 
> different from the home directory and we don't want mount it per user
> This patch add the option "fs.trash.dir" to special the trash dir( 
> ${fs.trash.dir}/$USER/.Trash)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-8708) DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies

2021-05-18 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17346631#comment-17346631
 ] 

Qi Zhu commented on HDFS-8708:
--

cc [~hexiaoqiao] [~ayushtkn] 

I agree with [~chengbing.liu], we should handle the case with both HA and no 
HA. It actually happened in our production clusters.

Could you take a look this?

Thanks.

> DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies
> --
>
> Key: HDFS-8708
> URL: https://issues.apache.org/jira/browse/HDFS-8708
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Jitendra Nath Pandey
>Assignee: Chengbing Liu
>Priority: Critical
> Attachments: HDFS-8708.001.patch, HDFS-8708.002.patch
>
>
> DFSClient should ignore dfs.client.retry.policy.enabled for HA proxies to 
> ensure fast failover. Otherwise, dfsclient retries the NN which is no longer 
> active and delays the failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16003) ProcessReport print invalidatedBlocks should judge debug level at first

2021-05-10 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341776#comment-17341776
 ] 

Qi Zhu commented on HDFS-16003:
---

LGTM +1. 

> ProcessReport print invalidatedBlocks should judge debug level at first
> ---
>
> Key: HDFS-16003
> URL: https://issues.apache.org/jira/browse/HDFS-16003
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namanode
>Affects Versions: 3.3.0
>Reporter: lei w
>Assignee: lei w
>Priority: Minor
> Attachments: HDFS-16003.001.path, HDFS-16003.patch
>
>
> In BlockManager#processReport( ) method, we will print invalidated blocks if 
> log level is debug。We always traverse this invalidatedBlocks list without 
> considering the log level。I suggest to give priority to the log level before 
> printing, which can save the time of traversal if log  level is info.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15968) Improve the log for The DecayRpcScheduler

2021-05-10 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17341675#comment-17341675
 ] 

Qi Zhu commented on HDFS-15968:
---

Thanks [~bpatel] for cotribution.

The path LGTM +1.

> Improve the log for The DecayRpcScheduler 
> --
>
> Key: HDFS-15968
> URL: https://issues.apache.org/jira/browse/HDFS-15968
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Bhavik Patel
>Assignee: Bhavik Patel
>Priority: Minor
> Attachments: HDFS-15968.001.patch
>
>
> Improve the log for The DecayRpcScheduler to make use of the SELF4j logger 
> factory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15994) Deletion should sleep some time, when there are too many pending deletion blocks.

2021-05-07 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340857#comment-17340857
 ] 

Qi Zhu edited comment on HDFS-15994 at 5/7/21, 2:08 PM:


Thanks a lot [~hexiaoqiao] for reply.

The deletion process:

1)Namespace Delete:remove the file related meta from Inode Tree

2)Remove Block: remove the blocks from BlockMap, and add the blocks to 
InvalidateBlocks

3)Waiting ReplicationMonitor to trigger Delete Work,send heartbeat to DN for 
deleting

 

The 2 step dominant about 90% RPC handler for deletion, and the 3 step are 
async, will not affect the RPC handler. 

About the using `release lock - sleep - acquire lock` to avoid NameNode hang 
for long time, i am not sure how to avoid lock too busy so just give this 
choice to release.

And about the multi-thread deletion may use too many handler, we can make the 
deletion async to release the handler in a following Jira.

For this jira, we can discuss how to avoid the lock too busy beside the  
`release lock - sleep - acquire lock` choice.

cc [~weichiu]  [~sodonnell] [~ayushtkn]  

What's your opinions ?  

Thanks.

 

 


was (Author: zhuqi):
Thanks a lot [~hexiaoqiao] for reply.

The deletion process:

1)Namespace Delete:remove the file related meta from Inode Tree;

2)Remove Block: remove the blocks from BlockMap, and add the blocks to 
InvalidateBlocks. 

3)Waiting ReplicationMonitor to trigger Delete Work,send heartbeat to DN for 
deleting。

 

The 2 step dominant about 90% RPC handler for deletion, and the 3 step are 
async, will not affect the RPC handler. 

About the using `release lock - sleep - acquire lock` to avoid NameNode hang 
for long time, i am not sure how to avoid lock too busy so just give this 
choice to release.

And about the multi-thread deletion may use too many handler, we can make the 
deletion async to release the handler in a following Jira.

For this jira, we can discuss how to avoid the lock too busy beside the  
`release lock - sleep - acquire lock` choice.

cc [~weichiu]  [~sodonnell] [~ayushtkn]  

What's your opinions ?  

Thanks.

 

 

> Deletion should sleep some time, when there are too many pending deletion 
> blocks.
> -
>
> Key: HDFS-15994
> URL: https://issues.apache.org/jira/browse/HDFS-15994
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15994.001.patch
>
>
> HDFS-13831 realize that we can control the frequency of other waiters to get 
> the lock chance.
> But actually in our big cluster with heavy deletion:
> The problem still happened,  and the pending deletion blocks will be more 
> than ten million somtimes, and the size become more than 1 million in regular 
> in huge clusters.
> So i think we should sleep for some time when pending too many deletion 
> blocks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15994) Deletion should sleep some time, when there are too many pending deletion blocks.

2021-05-07 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340857#comment-17340857
 ] 

Qi Zhu commented on HDFS-15994:
---

Thanks a lot [~hexiaoqiao] for reply.

The deletion process:

1)Namespace Delete:remove the file related meta from Inode Tree;

2)Remove Block: remove the blocks from BlockMap, and add the blocks to 
InvalidateBlocks. 

3)Waiting ReplicationMonitor to trigger Delete Work,send heartbeat to DN for 
deleting。

 

The 2 step dominant about 90% RPC handler for deletion, and the 3 step are 
async, will not affect the RPC handler. 

About the using `release lock - sleep - acquire lock` to avoid NameNode hang 
for long time, i am not sure how to avoid lock too busy so just give this 
choice to release.

And about the multi-thread deletion may use too many handler, we can make the 
deletion async to release the handler in a following Jira.

For this jira, we can discuss how to avoid the lock too busy beside the  
`release lock - sleep - acquire lock` choice.

cc [~weichiu]  [~sodonnell] [~ayushtkn]  

What's your opinions ?  

Thanks.

 

 

> Deletion should sleep some time, when there are too many pending deletion 
> blocks.
> -
>
> Key: HDFS-15994
> URL: https://issues.apache.org/jira/browse/HDFS-15994
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15994.001.patch
>
>
> HDFS-13831 realize that we can control the frequency of other waiters to get 
> the lock chance.
> But actually in our big cluster with heavy deletion:
> The problem still happened,  and the pending deletion blocks will be more 
> than ten million somtimes, and the size become more than 1 million in regular 
> in huge clusters.
> So i think we should sleep for some time when pending too many deletion 
> blocks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15934) Make DirectoryScanner reconcile blocks batch size and interval between batch configurable.

2021-05-05 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339963#comment-17339963
 ] 

Qi Zhu commented on HDFS-15934:
---

Thanks [~ayushtkn] for commit.

> Make DirectoryScanner reconcile blocks batch size and interval between batch 
> configurable.
> --
>
> Key: HDFS-15934
> URL: https://issues.apache.org/jira/browse/HDFS-15934
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> HDFS-14476 Make this batch to avoid lock too much time, but different cluster 
> has different demand, we should make batch size and batch interval 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13904) ContentSummary does not always respect processing limit, resulting in long lock acquisitions

2021-04-26 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331878#comment-17331878
 ] 

Qi Zhu commented on HDFS-13904:
---

Is this going on?

I met the problem also.--

Thanks.

> ContentSummary does not always respect processing limit, resulting in long 
> lock acquisitions
> 
>
> Key: HDFS-13904
> URL: https://issues.apache.org/jira/browse/HDFS-13904
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs, namenode
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> HDFS-4995 added a config {{dfs.content-summary.limit}} which allows for an 
> administrator to set a limit on the number of entries processed during a 
> single acquisition of the {{FSNamesystemLock}} during the creation of a 
> content summary. This is useful to prevent very long (multiple seconds) 
> pauses on the NameNode when {{getContentSummary}} is called on large 
> directories.
> However, even on versions with HDFS-4995, we have seen warnings like:
> {code}
> INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem read 
> lock held for 9398 ms via
> java.lang.Thread.getStackTrace(Thread.java:1552)
> org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:950)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.readUnlock(FSNamesystemLock.java:188)
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.readUnlock(FSNamesystem.java:1486)
> org.apache.hadoop.hdfs.server.namenode.ContentSummaryComputationContext.yield(ContentSummaryComputationContext.java:109)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:679)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeContentSummary(INodeDirectory.java:642)
> org.apache.hadoop.hdfs.server.namenode.INodeDirectory.computeDirectoryContentSummary(INodeDirectory.java:656)
> {code}
> happen quite consistently when {{getContentSummary}} was called on a large 
> directory on a heavily-loaded NameNode. Such long pauses completely destroy 
> the performance of the NameNode. We have the limit set to its default of 
> 5000; if it was respected, clearly there would not be a 10-second pause.
> The current {{yield()}} code within {{ContentSummaryComputationContext}} 
> looks like:
> {code}
>   public boolean yield() {
> // Are we set up to do this?
> if (limitPerRun <= 0 || dir == null || fsn == null) {
>   return false;
> }
> // Have we reached the limit?
> long currentCount = counts.getFileCount() +
> counts.getSymlinkCount() +
> counts.getDirectoryCount() +
> counts.getSnapshotableDirectoryCount();
> if (currentCount <= nextCountLimit) {
>   return false;
> }
> // Update the next limit
> nextCountLimit = currentCount + limitPerRun;
> boolean hadDirReadLock = dir.hasReadLock();
> boolean hadDirWriteLock = dir.hasWriteLock();
> boolean hadFsnReadLock = fsn.hasReadLock();
> boolean hadFsnWriteLock = fsn.hasWriteLock();
> // sanity check.
> if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
> hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
> fsn.getReadHoldCount() != 1) {
>   // cannot relinquish
>   return false;
> }
> // unlock
> dir.readUnlock();
> fsn.readUnlock("contentSummary");
> try {
>   Thread.sleep(sleepMilliSec, sleepNanoSec);
> } catch (InterruptedException ie) {
> } finally {
>   // reacquire
>   fsn.readLock();
>   dir.readLock();
> }
> yieldCount++;
> return true;
>   }
> {code}
> We believe that this check in particular is the culprit:
> {code}
> if (!hadDirReadLock || !hadFsnReadLock || hadDirWriteLock ||
> hadFsnWriteLock || dir.getReadHoldCount() != 1 ||
> fsn.getReadHoldCount() != 1) {
>   // cannot relinquish
>   return false;
> }
> {code}
> The content summary computation will only relinquish the lock if it is 
> currently the _only_ holder of the lock. Given the high volume of read 
> requests on a heavily loaded NameNode, especially when unfair locking is 
> enabled, it is likely there may be another holder of the read lock performing 
> some short-lived operation. By refusing to give up the lock in this case, the 
> content summary computation ends up never relinquishing the lock.
> We propose to simply remove the readHoldCount checks from this {{yield()}}. 
> This should alleviate the case described above by giving up the read lock and 
> allowing other short-lived operations to complete (while the content summary 
> thread sleeps) so that the lock can finally be given up completely. This has 
> the drawback that sometimes, the content summary may give up the lock 
> 

[jira] [Comment Edited] (HDFS-14617) Improve fsimage load time by writing sub-sections to the fsimage index

2021-04-24 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331414#comment-17331414
 ] 

Qi Zhu edited comment on HDFS-14617 at 4/25/21, 5:49 AM:
-

cc [~weichiu] [~sodonnell] [~hexiaoqiao] 

Could you help backport to 3.2.2 and 3.2.1 ? Our production clusters need to 
use this in 3.2.2.

Thanks.


was (Author: zhuqi):
cc [~sodonnell] [~hexiaoqiao] 

Could you help backport to 3.2.2 and 3.2.1 ? Our production clusters need to 
use this in 3.2.2.

Thanks.

> Improve fsimage load time by writing sub-sections to the fsimage index
> --
>
> Key: HDFS-14617
> URL: https://issues.apache.org/jira/browse/HDFS-14617
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 2.10.0, 3.3.0
>
> Attachments: HDFS-14617.001.patch, ParallelLoading.svg, 
> SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, 
> flamegraph.serial.svg, inodes.svg
>
>
> Loading an fsimage is basically a single threaded process. The current 
> fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, 
> Snapshot_Diff etc. Then at the end of the file, an index is written that 
> contains the offset and length of each section. The image loader code uses 
> this index to initialize an input stream to read and process each section. It 
> is important that one section is fully loaded before another is started, as 
> the next section depends on the results of the previous one.
> What I would like to propose is the following:
> 1. When writing the image, we can optionally output sub_sections to the 
> index. That way, a given section would effectively be split into several 
> sections, eg:
> {code:java}
>inode_section offset 10 length 1000
>  inode_sub_section offset 10 length 500
>  inode_sub_section offset 510 length 500
>  
>inode_dir_section offset 1010 length 1000
>  inode_dir_sub_section offset 1010 length 500
>  inode_dir_sub_section offset 1010 length 500
> {code}
> Here you can see we still have the original section index, but then we also 
> have sub-section entries that cover the entire section. Then a processor can 
> either read the full section in serial, or read each sub-section in parallel.
> 2. In the Image Writer code, we should set a target number of sub-sections, 
> and then based on the total inodes in memory, it will create that many 
> sub-sections per major image section. I think the only sections worth doing 
> this for are inode, inode_reference, inode_dir and snapshot_diff. All others 
> tend to be fairly small in practice.
> 3. If there are under some threshold of inodes (eg 10M) then don't bother 
> with the sub-sections as a serial load only takes a few seconds at that scale.
> 4. The image loading code can then have a switch to enable 'parallel loading' 
> and a 'number of threads' where it uses the sub-sections, or if not enabled 
> falls back to the existing logic to read the entire section in serial.
> Working with a large image of 316M inodes and 35GB on disk, I have a proof of 
> concept of this change working, allowing just inode and inode_dir to be 
> loaded in parallel, but I believe inode_reference and snapshot_diff can be 
> make parallel with the same technique.
> Some benchmarks I have are as follows:
> {code:java}
> Threads   1 2 3 4 
> 
> inodes448   290   226   189 
> inode_dir 326   211   170   161 
> Total 927   651   535   488 (MD5 calculation about 100 seconds)
> {code}
> The above table shows the time in seconds to load the inode section and the 
> inode_directory section, and then the total load time of the image.
> With 4 threads using the above technique, we are able to better than half the 
> load time of the two sections. With the patch in HDFS-13694 it would take a 
> further 100 seconds off the run time, going from 927 seconds to 388, which is 
> a significant improvement. Adding more threads beyond 4 has diminishing 
> returns as there are some synchronized points in the loading code to protect 
> the in memory structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14617) Improve fsimage load time by writing sub-sections to the fsimage index

2021-04-24 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331414#comment-17331414
 ] 

Qi Zhu commented on HDFS-14617:
---

cc [~sodonnell] [~hexiaoqiao] 

Could you help backport to 3.2.2 and 3.2.1 ? Our production clusters need to 
use this in 3.2.2.

Thanks.

> Improve fsimage load time by writing sub-sections to the fsimage index
> --
>
> Key: HDFS-14617
> URL: https://issues.apache.org/jira/browse/HDFS-14617
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
> Fix For: 2.10.0, 3.3.0
>
> Attachments: HDFS-14617.001.patch, ParallelLoading.svg, 
> SerialLoading.svg, dirs-single.svg, flamegraph.parallel.svg, 
> flamegraph.serial.svg, inodes.svg
>
>
> Loading an fsimage is basically a single threaded process. The current 
> fsimage is written out in sections, eg iNode, iNode_Directory, Snapshots, 
> Snapshot_Diff etc. Then at the end of the file, an index is written that 
> contains the offset and length of each section. The image loader code uses 
> this index to initialize an input stream to read and process each section. It 
> is important that one section is fully loaded before another is started, as 
> the next section depends on the results of the previous one.
> What I would like to propose is the following:
> 1. When writing the image, we can optionally output sub_sections to the 
> index. That way, a given section would effectively be split into several 
> sections, eg:
> {code:java}
>inode_section offset 10 length 1000
>  inode_sub_section offset 10 length 500
>  inode_sub_section offset 510 length 500
>  
>inode_dir_section offset 1010 length 1000
>  inode_dir_sub_section offset 1010 length 500
>  inode_dir_sub_section offset 1010 length 500
> {code}
> Here you can see we still have the original section index, but then we also 
> have sub-section entries that cover the entire section. Then a processor can 
> either read the full section in serial, or read each sub-section in parallel.
> 2. In the Image Writer code, we should set a target number of sub-sections, 
> and then based on the total inodes in memory, it will create that many 
> sub-sections per major image section. I think the only sections worth doing 
> this for are inode, inode_reference, inode_dir and snapshot_diff. All others 
> tend to be fairly small in practice.
> 3. If there are under some threshold of inodes (eg 10M) then don't bother 
> with the sub-sections as a serial load only takes a few seconds at that scale.
> 4. The image loading code can then have a switch to enable 'parallel loading' 
> and a 'number of threads' where it uses the sub-sections, or if not enabled 
> falls back to the existing logic to read the entire section in serial.
> Working with a large image of 316M inodes and 35GB on disk, I have a proof of 
> concept of this change working, allowing just inode and inode_dir to be 
> loaded in parallel, but I believe inode_reference and snapshot_diff can be 
> make parallel with the same technique.
> Some benchmarks I have are as follows:
> {code:java}
> Threads   1 2 3 4 
> 
> inodes448   290   226   189 
> inode_dir 326   211   170   161 
> Total 927   651   535   488 (MD5 calculation about 100 seconds)
> {code}
> The above table shows the time in seconds to load the inode section and the 
> inode_directory section, and then the total load time of the image.
> With 4 threads using the above technique, we are able to better than half the 
> load time of the two sections. With the patch in HDFS-13694 it would take a 
> further 100 seconds off the run time, going from 927 seconds to 388, which is 
> a significant improvement. Adding more threads beyond 4 has diminishing 
> returns as there are some synchronized points in the loading code to protect 
> the in memory structures.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13831) Make block increment deletion number configurable

2021-04-22 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327227#comment-17327227
 ] 

Qi Zhu commented on HDFS-13831:
---

[~weichiu] [~linyiqun] [~gaofeng6] 

I created HDFS-15994 to improve this more usable in huge clusters with heavy 
deletion.

Thanks.

> Make block increment deletion number configurable
> -
>
> Key: HDFS-13831
> URL: https://issues.apache.org/jira/browse/HDFS-13831
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.0
>Reporter: Yiqun Lin
>Assignee: Ryan Wu
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.0.4, 3.1.2
>
> Attachments: HDFS-13831.001.patch, HDFS-13831.002.patch, 
> HDFS-13831.003.patch, HDFS-13831.004.patch, HDFS-13831.branch-3.0.001.patch
>
>
> When NN deletes a large directory, it will hold the write lock long time. For 
> improving this, we remove the blocks in a batch way. So that other waiters 
> have a chance to get the lock. But right now, the batch number is a 
> hard-coded value.
> {code:java}
>   static int BLOCK_DELETION_INCREMENT = 1000;
> {code}
> We can make this value configurable, so that we can control the frequency of 
> other waiters to get the lock chance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15994) Deletion should sleep some time, when there are too many pending deletion blocks.

2021-04-22 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327218#comment-17327218
 ] 

Qi Zhu edited comment on HDFS-15994 at 4/22/21, 9:50 AM:
-

cc [~weichiu] [~hexiaoqiao] [~sodonnell] [~ayushtkn]  [~linyiqun] 
[~jianliang.wu]  

I submitted a patch for review, i think we should improve the block deletions 
in huge clusters.

What's your opinions about this?

Thanks.


was (Author: zhuqi):
cc [~weichiu] [~hexiaoqiao] [~sodonnell] [~ayushtkn]  [~linyiqun] 
[~jianliang.wu]  

I submitted a patch for review, i think we should improve the block deletions 
in huge clusters.--

What's your opinions about this?

Thanks.

> Deletion should sleep some time, when there are too many pending deletion 
> blocks.
> -
>
> Key: HDFS-15994
> URL: https://issues.apache.org/jira/browse/HDFS-15994
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15994.001.patch
>
>
> HDFS-13831 realize that we can control the frequency of other waiters to get 
> the lock chance.
> But actually in our big cluster with heavy deletion:
> The problem still happened,  and the pending deletion blocks will be more 
> than ten million somtimes, and the size become more than 1 million in regular 
> in huge clusters.
> So i think we should sleep for some time when pending too many deletion 
> blocks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15994) Deletion should sleep some time, when there are too many pending deletion blocks.

2021-04-22 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327218#comment-17327218
 ] 

Qi Zhu commented on HDFS-15994:
---

cc [~weichiu] [~hexiaoqiao] [~sodonnell] [~ayushtkn]  [~linyiqun] 
[~jianliang.wu]  

I submitted a patch for review, i think we should improve the block deletions 
in huge clusters.--

What's your opinions about this?

Thanks.

> Deletion should sleep some time, when there are too many pending deletion 
> blocks.
> -
>
> Key: HDFS-15994
> URL: https://issues.apache.org/jira/browse/HDFS-15994
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15994.001.patch
>
>
> HDFS-13831 realize that we can control the frequency of other waiters to get 
> the lock chance.
> But actually in our big cluster with heavy deletion:
> The problem still happened,  and the pending deletion blocks will be more 
> than ten million somtimes, and the size become more than 1 million in regular 
> in huge clusters.
> So i think we should sleep for some time when pending too many deletion 
> blocks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15994) Deletion should sleep some time, when there are too many pending deletion blocks.

2021-04-22 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15994:
--
Attachment: HDFS-15994.001.patch
Status: Patch Available  (was: Open)

> Deletion should sleep some time, when there are too many pending deletion 
> blocks.
> -
>
> Key: HDFS-15994
> URL: https://issues.apache.org/jira/browse/HDFS-15994
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15994.001.patch
>
>
> HDFS-13831 realize that we can control the frequency of other waiters to get 
> the lock chance.
> But actually in our big cluster with heavy deletion:
> The problem still happened,  and the pending deletion blocks will be more 
> than ten million somtimes, and the size become more than 1 million in regular 
> in huge clusters.
> So i think we should sleep for some time when pending too many deletion 
> blocks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15994) Deletion should sleep some time, when there are too many pending deletion blocks.

2021-04-22 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15994:
--
Description: 
HDFS-13831 realize that we can control the frequency of other waiters to get 
the lock chance.

But actually in our big cluster with heavy deletion:

The problem still happened,  and the pending deletion blocks will be more than 
ten million somtimes, and the size become more than 1 million in regular in 
huge clusters.

So i think we should sleep for some time when pending too many deletion blocks. 

> Deletion should sleep some time, when there are too many pending deletion 
> blocks.
> -
>
> Key: HDFS-15994
> URL: https://issues.apache.org/jira/browse/HDFS-15994
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>
> HDFS-13831 realize that we can control the frequency of other waiters to get 
> the lock chance.
> But actually in our big cluster with heavy deletion:
> The problem still happened,  and the pending deletion blocks will be more 
> than ten million somtimes, and the size become more than 1 million in regular 
> in huge clusters.
> So i think we should sleep for some time when pending too many deletion 
> blocks. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15994) Deletion should sleep some time, when there are too many pending deletion blocks.

2021-04-22 Thread Qi Zhu (Jira)
Qi Zhu created HDFS-15994:
-

 Summary: Deletion should sleep some time, when there are too many 
pending deletion blocks.
 Key: HDFS-15994
 URL: https://issues.apache.org/jira/browse/HDFS-15994
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Qi Zhu
Assignee: Qi Zhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-14525) JspHelper ignores hadoop.http.authentication.type

2021-04-22 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327104#comment-17327104
 ] 

Qi Zhu edited comment on HDFS-14525 at 4/22/21, 6:52 AM:
-

[~prabhujoseph]  [~eyang] [~daryn]

I also think this is needed, if we can add an option to support:

We can add an option to allow this two independent, 
hadoop.security.authentication is specific to RPC Authentication whereas 
hadoop.http.authentication.type is specific to HTTP Authentication.

We want to make HTTP not authentication, but RPC Authentication.

 

How to handle the case:

1. The HTTP authentication is simple, an we don't want to set browser access 
with keytab.

2. The service RPC is kerberos based.

3. The webhdfs we want to use the kerberos also.

 

But with the HADOOP-16354

 The JspHelper#getugi :
{code:java}
if (UserGroupInformation.isSecurityEnabled()) {
  remoteUser = request.getRemoteUser();
  final String tokenString = request.getParameter(DELEGATION_PARAMETER_NAME);
  if (tokenString != null) {

// user.name, doas param is ignored in the token-based auth
ugi = getTokenUGI(context, request, tokenString, conf);
  } else if (remoteUser == null) {
throw new IOException(
"Security enabled but user not authenticated by filter");
  }
}
{code}
Will get null remoteUser here, because we don't get a principal for simple way. 

the command : hadoop fs -ls webhdfs://host:port/ 
will throw "Security enabled but user not authenticated by filter".

What's your opinions and how the solve it?

Thanks.

 


was (Author: zhuqi):
[~prabhujoseph] 

I also think this is needed, if we can add an option to support:

We can add an option to allow this two independent, 
hadoop.security.authentication is specific to RPC Authentication whereas 
hadoop.http.authentication.type is specific to HTTP Authentication.

We want to make HTTP not authentication, but RPC Authentication.

 

> JspHelper ignores hadoop.http.authentication.type
> -
>
> Key: HDFS-14525
> URL: https://issues.apache.org/jira/browse/HDFS-14525
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> On Secure Cluster With hadoop.http.authentication.type simple and 
> hadoop.http.authentication.anonymous.allowed is true, WebHdfs Rest Api fails 
> when user.name is not set. It runs fine if user.name=ambari-qa is set..
> {code}
> [knox@pjosephdocker-1 ~]$ curl -sS -L -w '%{http_code}' -X GET -d '' -H 
> 'Content-Length: 0' --negotiate -u : 
> 'http://pjosephdocker-1.openstacklocal:50070/webhdfs/v1/services/sync/yarn-ats?op=GETFILESTATUS'
> {"RemoteException":{"exception":"SecurityException","javaClassName":"java.lang.SecurityException","message":"Failed
>  to obtain user group information: java.io.IOException: Security enabled but 
> user not authenticated by filter"}}403[knox@pjosephdocker-1 ~]$ 
> {code}
> JspHelper#getUGI checks UserGroupInformation.isSecurityEnabled() instead of 
> conf.get(hadoop.http.authentication.type).equals("kerberos") to check if Http 
> is Secure causing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-14525) JspHelper ignores hadoop.http.authentication.type

2021-04-21 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327104#comment-17327104
 ] 

Qi Zhu commented on HDFS-14525:
---

[~prabhujoseph] 

I also think this is needed, if we can add an option to support:

We can add an option to allow this two independent, 
hadoop.security.authentication is specific to RPC Authentication whereas 
hadoop.http.authentication.type is specific to HTTP Authentication.

We want to make HTTP not authentication, but RPC Authentication.

 

> JspHelper ignores hadoop.http.authentication.type
> -
>
> Key: HDFS-14525
> URL: https://issues.apache.org/jira/browse/HDFS-14525
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 3.2.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> On Secure Cluster With hadoop.http.authentication.type simple and 
> hadoop.http.authentication.anonymous.allowed is true, WebHdfs Rest Api fails 
> when user.name is not set. It runs fine if user.name=ambari-qa is set..
> {code}
> [knox@pjosephdocker-1 ~]$ curl -sS -L -w '%{http_code}' -X GET -d '' -H 
> 'Content-Length: 0' --negotiate -u : 
> 'http://pjosephdocker-1.openstacklocal:50070/webhdfs/v1/services/sync/yarn-ats?op=GETFILESTATUS'
> {"RemoteException":{"exception":"SecurityException","javaClassName":"java.lang.SecurityException","message":"Failed
>  to obtain user group information: java.io.IOException: Security enabled but 
> user not authenticated by filter"}}403[knox@pjosephdocker-1 ~]$ 
> {code}
> JspHelper#getUGI checks UserGroupInformation.isSecurityEnabled() instead of 
> conf.get(hadoop.http.authentication.type).equals("kerberos") to check if Http 
> is Secure causing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2021-04-04 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314657#comment-17314657
 ] 

Qi Zhu commented on HDFS-15601:
---

[~ayushtkn] [~hexiaoqiao]

Could you help review this when you are free?

> Batch listing: gracefully fallback to use non-batched listing when NameNode 
> doesn't support the feature
> ---
>
> Key: HDFS-15601
> URL: https://issues.apache.org/jira/browse/HDFS-15601
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Chao Sun
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15601.001.patch
>
>
> HDFS-13616 requires both server and client side change. However, it is common 
> that users use a newer client to talk to older HDFS (say 2.10). Currently the 
> client will simply fail in this scenario. A better approach, perhaps, is to 
> have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2021-04-04 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314516#comment-17314516
 ] 

Qi Zhu commented on HDFS-15601:
---

I will take it,  since still no one take.

> Batch listing: gracefully fallback to use non-batched listing when NameNode 
> doesn't support the feature
> ---
>
> Key: HDFS-15601
> URL: https://issues.apache.org/jira/browse/HDFS-15601
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Chao Sun
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15601.001.patch
>
>
> HDFS-13616 requires both server and client side change. However, it is common 
> that users use a newer client to talk to older HDFS (say 2.10). Currently the 
> client will simply fail in this scenario. A better approach, perhaps, is to 
> have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2021-04-04 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15601:
--
Attachment: HDFS-15601.001.patch
Status: Patch Available  (was: Open)

> Batch listing: gracefully fallback to use non-batched listing when NameNode 
> doesn't support the feature
> ---
>
> Key: HDFS-15601
> URL: https://issues.apache.org/jira/browse/HDFS-15601
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Chao Sun
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15601.001.patch
>
>
> HDFS-13616 requires both server and client side change. However, it is common 
> that users use a newer client to talk to older HDFS (say 2.10). Currently the 
> client will simply fail in this scenario. A better approach, perhaps, is to 
> have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15601) Batch listing: gracefully fallback to use non-batched listing when NameNode doesn't support the feature

2021-04-04 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu reassigned HDFS-15601:
-

Assignee: Qi Zhu

> Batch listing: gracefully fallback to use non-batched listing when NameNode 
> doesn't support the feature
> ---
>
> Key: HDFS-15601
> URL: https://issues.apache.org/jira/browse/HDFS-15601
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: hdfs
>Reporter: Chao Sun
>Assignee: Qi Zhu
>Priority: Major
>
> HDFS-13616 requires both server and client side change. However, it is common 
> that users use a newer client to talk to older HDFS (say 2.10). Currently the 
> client will simply fail in this scenario. A better approach, perhaps, is to 
> have client fallback to use non-batched listing on the input directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15924) Log4j will cause Server handler blocked when audit log boom.

2021-03-30 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17311354#comment-17311354
 ] 

Qi Zhu commented on HDFS-15924:
---

Thanks a lot [~weichiu] for reply, i think  HDFS-15720 may give me a little 
relieve.

And looking forward to HADOOP-16206.

Which will update to log4j 2.

Thanks.

> Log4j will cause Server handler blocked when audit log boom.
> 
>
> Key: HDFS-15924
> URL: https://issues.apache.org/jira/browse/HDFS-15924
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Qi Zhu
>Priority: Major
> Attachments: image-2021-03-26-16-18-03-341.png, 
> image-2021-03-26-16-19-42-165.png
>
>
> !image-2021-03-26-16-18-03-341.png|width=707,height=234!
> !image-2021-03-26-16-19-42-165.png|width=824,height=198!
> The thread blocked when audit log boom show in above.
> Such as [https://dzone.com/articles/log4j-thread-deadlock-case] , it seems 
> the same case when heavy load, should we update to Log4j2 or other things we 
> can do to improve it in heavy audit log.
>  
> {code:java}
>  /**
>  Call the appenders in the hierrachy starting at
>  this.  If no appenders could be found, emit a
>  warning.
>  This method calls all the appenders inherited from the
>  hierarchy circumventing any evaluation of whether to log or not
>  to log the particular log request.
>  @param event the event to log.  */
> public void callAppenders(LoggingEvent event) {
> int writes = 0;
> for(Category c = this; c != null; c=c.parent) {
>   // Protected against simultaneous call to addAppender, 
> removeAppender,...
>   synchronized(c) {
> if(c.aai != null) {
> writes += c.aai.appendLoopOnAppenders(event);
> }
> if(!c.additive) {
> break;
> }
>   }
> }
> if(writes == 0) {
>   repository.emitNoAppenderWarning(this);
> }
>   }{code}
> The log4j code, use the  global synchronized, it will cause this happened.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15934) Make DirectoryScanner reconcile blocks batch size and interval between batch configurable.

2021-03-29 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15934:
--
Issue Type: Improvement  (was: Bug)

> Make DirectoryScanner reconcile blocks batch size and interval between batch 
> configurable.
> --
>
> Key: HDFS-15934
> URL: https://issues.apache.org/jira/browse/HDFS-15934
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>
> HDFS-14476 Make this batch to avoid lock too much time, but different cluster 
> has different demand, we should make batch size and batch interval 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15934) Make DirectoryScanner reconcile blocks batch size and interval between batch configurable.

2021-03-29 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15934:
--
Status: Patch Available  (was: Open)

> Make DirectoryScanner reconcile blocks batch size and interval between batch 
> configurable.
> --
>
> Key: HDFS-15934
> URL: https://issues.apache.org/jira/browse/HDFS-15934
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> HDFS-14476 Make this batch to avoid lock too much time, but different cluster 
> has different demand, we should make batch size and batch interval 
> configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15934) Make DirectoryScanner reconcile blocks batch size and interval between batch configurable.

2021-03-29 Thread Qi Zhu (Jira)
Qi Zhu created HDFS-15934:
-

 Summary: Make DirectoryScanner reconcile blocks batch size and 
interval between batch configurable.
 Key: HDFS-15934
 URL: https://issues.apache.org/jira/browse/HDFS-15934
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Qi Zhu
Assignee: Qi Zhu


HDFS-14476 Make this batch to avoid lock too much time, but different cluster 
has different demand, we should make batch size and batch interval configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15930) Fix some @param errors in DirectoryScanner.

2021-03-28 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15930:
--
Status: Patch Available  (was: Open)

> Fix some @param errors in DirectoryScanner.
> ---
>
> Key: HDFS-15930
> URL: https://issues.apache.org/jira/browse/HDFS-15930
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15930) Fix some @param errors in DirectoryScanner.

2021-03-28 Thread Qi Zhu (Jira)
Qi Zhu created HDFS-15930:
-

 Summary: Fix some @param errors in DirectoryScanner.
 Key: HDFS-15930
 URL: https://issues.apache.org/jira/browse/HDFS-15930
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Qi Zhu
Assignee: Qi Zhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15924) Log4j will cause Server handler blocked when audit log boom.

2021-03-26 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17309442#comment-17309442
 ] 

Qi Zhu commented on HDFS-15924:
---

[~marvelrock]

The version of this cluster is 2.6.0-cdh5.11.0.

> Log4j will cause Server handler blocked when audit log boom.
> 
>
> Key: HDFS-15924
> URL: https://issues.apache.org/jira/browse/HDFS-15924
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Qi Zhu
>Priority: Major
> Attachments: image-2021-03-26-16-18-03-341.png, 
> image-2021-03-26-16-19-42-165.png
>
>
> !image-2021-03-26-16-18-03-341.png|width=707,height=234!
> !image-2021-03-26-16-19-42-165.png|width=824,height=198!
> The thread blocked when audit log boom show in above.
> Such as [https://dzone.com/articles/log4j-thread-deadlock-case] , it seems 
> the same case when heavy load, should we update to Log4j2 or other things we 
> can do to improve it in heavy audit log.
>  
> {code:java}
>  /**
>  Call the appenders in the hierrachy starting at
>  this.  If no appenders could be found, emit a
>  warning.
>  This method calls all the appenders inherited from the
>  hierarchy circumventing any evaluation of whether to log or not
>  to log the particular log request.
>  @param event the event to log.  */
> public void callAppenders(LoggingEvent event) {
> int writes = 0;
> for(Category c = this; c != null; c=c.parent) {
>   // Protected against simultaneous call to addAppender, 
> removeAppender,...
>   synchronized(c) {
> if(c.aai != null) {
> writes += c.aai.appendLoopOnAppenders(event);
> }
> if(!c.additive) {
> break;
> }
>   }
> }
> if(writes == 0) {
>   repository.emitNoAppenderWarning(this);
> }
>   }{code}
> The log4j code, use the  global synchronized, it will cause this happened.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15924) Log4j will cause Server handler blocked when audit log boom.

2021-03-26 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17309403#comment-17309403
 ] 

Qi Zhu commented on HDFS-15924:
---

Thanks [~sodonnell] for reply.

Actually the async audit logging is already enabled in our cluster.
{code:java}

  dfs.namenode.audit.log.async
  true
{code}
But the thread blocked when audit log boom still happened, and the single 
cluster has thousands of nodes.

> Log4j will cause Server handler blocked when audit log boom.
> 
>
> Key: HDFS-15924
> URL: https://issues.apache.org/jira/browse/HDFS-15924
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Qi Zhu
>Priority: Major
> Attachments: image-2021-03-26-16-18-03-341.png, 
> image-2021-03-26-16-19-42-165.png
>
>
> !image-2021-03-26-16-18-03-341.png|width=707,height=234!
> !image-2021-03-26-16-19-42-165.png|width=824,height=198!
> The thread blocked when audit log boom show in above.
> Such as [https://dzone.com/articles/log4j-thread-deadlock-case] , it seems 
> the same case when heavy load, should we update to Log4j2 or other things we 
> can do to improve it in heavy audit log.
>  
> {code:java}
>  /**
>  Call the appenders in the hierrachy starting at
>  this.  If no appenders could be found, emit a
>  warning.
>  This method calls all the appenders inherited from the
>  hierarchy circumventing any evaluation of whether to log or not
>  to log the particular log request.
>  @param event the event to log.  */
> public void callAppenders(LoggingEvent event) {
> int writes = 0;
> for(Category c = this; c != null; c=c.parent) {
>   // Protected against simultaneous call to addAppender, 
> removeAppender,...
>   synchronized(c) {
> if(c.aai != null) {
> writes += c.aai.appendLoopOnAppenders(event);
> }
> if(!c.additive) {
> break;
> }
>   }
> }
> if(writes == 0) {
>   repository.emitNoAppenderWarning(this);
> }
>   }{code}
> The log4j code, use the  global synchronized, it will cause this happened.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15924) Log4j will cause Server handler blocked when audit log boom.

2021-03-26 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15924:
--
Description: 
!image-2021-03-26-16-18-03-341.png|width=707,height=234!

!image-2021-03-26-16-19-42-165.png|width=824,height=198!

The thread blocked when audit log boom show in above.

Such as [https://dzone.com/articles/log4j-thread-deadlock-case] , it seems the 
same case when heavy load, should we update to Log4j2 or other things we can do 
to improve it in heavy audit log.

 
{code:java}
 /**
 Call the appenders in the hierrachy starting at
 this.  If no appenders could be found, emit a
 warning.

 This method calls all the appenders inherited from the
 hierarchy circumventing any evaluation of whether to log or not
 to log the particular log request.

 @param event the event to log.  */
public void callAppenders(LoggingEvent event) {
int writes = 0;

for(Category c = this; c != null; c=c.parent) {
  // Protected against simultaneous call to addAppender, removeAppender,...
  synchronized(c) {
if(c.aai != null) {
writes += c.aai.appendLoopOnAppenders(event);
}
if(!c.additive) {
break;
}
  }
}

if(writes == 0) {
  repository.emitNoAppenderWarning(this);
}
  }{code}
The log4j code, use the  global synchronized, it will cause this happened.

cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]

  was:
!image-2021-03-26-16-18-03-341.png|width=707,height=234!

!image-2021-03-26-16-19-42-165.png|width=824,height=198!

The thread blocked when audit log boom show in above.

Such as [https://dzone.com/articles/log4j-thread-deadlock-case] , it seems the 
same case when heavy load, should we update to Log4j2 or other things we can do 
to improve it in heavy audit log.

cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]


> Log4j will cause Server handler blocked when audit log boom.
> 
>
> Key: HDFS-15924
> URL: https://issues.apache.org/jira/browse/HDFS-15924
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Qi Zhu
>Priority: Major
> Attachments: image-2021-03-26-16-18-03-341.png, 
> image-2021-03-26-16-19-42-165.png
>
>
> !image-2021-03-26-16-18-03-341.png|width=707,height=234!
> !image-2021-03-26-16-19-42-165.png|width=824,height=198!
> The thread blocked when audit log boom show in above.
> Such as [https://dzone.com/articles/log4j-thread-deadlock-case] , it seems 
> the same case when heavy load, should we update to Log4j2 or other things we 
> can do to improve it in heavy audit log.
>  
> {code:java}
>  /**
>  Call the appenders in the hierrachy starting at
>  this.  If no appenders could be found, emit a
>  warning.
>  This method calls all the appenders inherited from the
>  hierarchy circumventing any evaluation of whether to log or not
>  to log the particular log request.
>  @param event the event to log.  */
> public void callAppenders(LoggingEvent event) {
> int writes = 0;
> for(Category c = this; c != null; c=c.parent) {
>   // Protected against simultaneous call to addAppender, 
> removeAppender,...
>   synchronized(c) {
> if(c.aai != null) {
> writes += c.aai.appendLoopOnAppenders(event);
> }
> if(!c.additive) {
> break;
> }
>   }
> }
> if(writes == 0) {
>   repository.emitNoAppenderWarning(this);
> }
>   }{code}
> The log4j code, use the  global synchronized, it will cause this happened.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15924) Log4j will cause Server handler blocked when audit log boom.

2021-03-26 Thread Qi Zhu (Jira)
Qi Zhu created HDFS-15924:
-

 Summary: Log4j will cause Server handler blocked when audit log 
boom.
 Key: HDFS-15924
 URL: https://issues.apache.org/jira/browse/HDFS-15924
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Qi Zhu
 Attachments: image-2021-03-26-16-18-03-341.png, 
image-2021-03-26-16-19-42-165.png

!image-2021-03-26-16-18-03-341.png|width=707,height=234!

!image-2021-03-26-16-19-42-165.png|width=824,height=198!

The thread blocked when audit log boom show in above.

Such as [https://dzone.com/articles/log4j-thread-deadlock-case] , it seems the 
same case when heavy load, should we update to Log4j2 or other things we can do 
to improve it in heavy audit log.

cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-10 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297235#comment-17297235
 ] 

Qi Zhu edited comment on HDFS-15874 at 3/10/21, 2:37 PM:
-

[~leosun08] [~linyiqun]  [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]

Could you help review this?

Thanks.


was (Author: zhuqi):
[~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]

Could you help review this?

Thanks.

> Extend TopMetrics to support callerContext aggregation.
> ---
>
> Key: HDFS-15874
> URL: https://issues.apache.org/jira/browse/HDFS-15874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-03-05-12-01-16-852.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now top metrics for namenode op is only supported aggregation for user,  the 
> top user op is useful. 
> But actually, the most useful is the callerContext aggregation, we can use 
> this to aggregate the running apps in yarn or the scheduling jobs by extend 
> the callerContext in such as (oozie or airflow), so that we can get the real 
> time top pressure to namenode.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-08 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297235#comment-17297235
 ] 

Qi Zhu commented on HDFS-15874:
---

[~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]

Could you help review this?

Thanks.

> Extend TopMetrics to support callerContext aggregation.
> ---
>
> Key: HDFS-15874
> URL: https://issues.apache.org/jira/browse/HDFS-15874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-03-05-12-01-16-852.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now top metrics for namenode op is only supported aggregation for user,  the 
> top user op is useful. 
> But actually, the most useful is the callerContext aggregation, we can use 
> this to aggregate the running apps in yarn or the scheduling jobs by extend 
> the callerContext in such as (oozie or airflow), so that we can get the real 
> time top pressure to namenode.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15879) Exclude slow nodes when choose targets for blocks

2021-03-08 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17297228#comment-17297228
 ] 

Qi Zhu commented on HDFS-15879:
---

Thanks [~tomscut] for proposal.

It seems duplicate with HDFS-14789.

 

> Exclude slow nodes when choose targets for blocks
> -
>
> Key: HDFS-15879
> URL: https://issues.apache.org/jira/browse/HDFS-15879
> Project: Hadoop HDFS
>  Issue Type: Wish
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Previously, we have monitored the slow nodes, related to 
> [HDFS-11194|https://issues.apache.org/jira/browse/HDFS-11194].
> We can use a thread to periodically collect these slow nodes into a set. Then 
> use the set to filter out slow nodes when choose targets for blocks.
> This feature can be configured to be turned on when needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-07 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15874:
--
Status: Patch Available  (was: In Progress)

> Extend TopMetrics to support callerContext aggregation.
> ---
>
> Key: HDFS-15874
> URL: https://issues.apache.org/jira/browse/HDFS-15874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-03-05-12-01-16-852.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now top metrics for namenode op is only supported aggregation for user,  the 
> top user op is useful. 
> But actually, the most useful is the callerContext aggregation, we can use 
> this to aggregate the running apps in yarn or the scheduling jobs by extend 
> the callerContext in such as (oozie or airflow), so that we can get the real 
> time top pressure to namenode.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-07 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15874:
--
Issue Type: Improvement  (was: New Feature)

> Extend TopMetrics to support callerContext aggregation.
> ---
>
> Key: HDFS-15874
> URL: https://issues.apache.org/jira/browse/HDFS-15874
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-03-05-12-01-16-852.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now top metrics for namenode op is only supported aggregation for user,  the 
> top user op is useful. 
> But actually, the most useful is the callerContext aggregation, we can use 
> this to aggregate the running apps in yarn or the scheduling jobs by extend 
> the callerContext in such as (oozie or airflow), so that we can get the real 
> time top pressure to namenode.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-04 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15874:
--
Description: 
Now top metrics for namenode op is only supported aggregation for user,  the 
top user op is useful. 

But actually, the most useful is the callerContext aggregation, we can use this 
to aggregate the running apps in yarn or the scheduling jobs by extend the 
callerContext in such as (oozie or airflow), so that we can get the real time 
top pressure to namenode.

cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]

  was:
Now top metrics for namenode op is only supported aggregation for user,  the 
top user op is useful. 

But actually, the most useful is the callerContext aggregation, we can use this 
to aggregate the running apps in yarn or the scheduling jobs by extend the 
callerContext in such as (oozie or airflow), so that we can get the real time 
top pressure to namenode.

cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv]


> Extend TopMetrics to support callerContext aggregation.
> ---
>
> Key: HDFS-15874
> URL: https://issues.apache.org/jira/browse/HDFS-15874
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-03-05-12-01-16-852.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now top metrics for namenode op is only supported aggregation for user,  the 
> top user op is useful. 
> But actually, the most useful is the callerContext aggregation, we can use 
> this to aggregate the running apps in yarn or the scheduling jobs by extend 
> the callerContext in such as (oozie or airflow), so that we can get the real 
> time top pressure to namenode.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv] [~ferhui]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-04 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17295744#comment-17295744
 ] 

Qi Zhu commented on HDFS-15874:
---

Tested the poc pull request in my local cluster. 

!image-2021-03-05-12-01-16-852.png|width=1151,height=306!

> Extend TopMetrics to support callerContext aggregation.
> ---
>
> Key: HDFS-15874
> URL: https://issues.apache.org/jira/browse/HDFS-15874
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-03-05-12-01-16-852.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now top metrics for namenode op is only supported aggregation for user,  the 
> top user op is useful. 
> But actually, the most useful is the callerContext aggregation, we can use 
> this to aggregate the running apps in yarn or the scheduling jobs by extend 
> the callerContext in such as (oozie or airflow), so that we can get the real 
> time top pressure to namenode.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-04 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15874:
--
Attachment: image-2021-03-05-12-01-16-852.png

> Extend TopMetrics to support callerContext aggregation.
> ---
>
> Key: HDFS-15874
> URL: https://issues.apache.org/jira/browse/HDFS-15874
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2021-03-05-12-01-16-852.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now top metrics for namenode op is only supported aggregation for user,  the 
> top user op is useful. 
> But actually, the most useful is the callerContext aggregation, we can use 
> this to aggregate the running apps in yarn or the scheduling jobs by extend 
> the callerContext in such as (oozie or airflow), so that we can get the real 
> time top pressure to namenode.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-04 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDFS-15874 started by Qi Zhu.
-
> Extend TopMetrics to support callerContext aggregation.
> ---
>
> Key: HDFS-15874
> URL: https://issues.apache.org/jira/browse/HDFS-15874
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
>
> Now top metrics for namenode op is only supported aggregation for user,  the 
> top user op is useful. 
> But actually, the most useful is the callerContext aggregation, we can use 
> this to aggregate the running apps in yarn or the scheduling jobs by extend 
> the callerContext in such as (oozie or airflow), so that we can get the real 
> time top pressure to namenode.
> cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15874) Extend TopMetrics to support callerContext aggregation.

2021-03-04 Thread Qi Zhu (Jira)
Qi Zhu created HDFS-15874:
-

 Summary: Extend TopMetrics to support callerContext aggregation.
 Key: HDFS-15874
 URL: https://issues.apache.org/jira/browse/HDFS-15874
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: Qi Zhu
Assignee: Qi Zhu


Now top metrics for namenode op is only supported aggregation for user,  the 
top user op is useful. 

But actually, the most useful is the callerContext aggregation, we can use this 
to aggregate the running apps in yarn or the scheduling jobs by extend the 
callerContext in such as (oozie or airflow), so that we can get the real time 
top pressure to namenode.

cc [~weichiu] [~hexiaoqiao] [~ayushtkn]  [~shv]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15856) Make recover the pipeline in same packet exceed times for stream closed configurable.

2021-03-01 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292964#comment-17292964
 ] 

Qi Zhu commented on HDFS-15856:
---

[~hexiaoqiao] [~ayushtkn]

Could you help review this?

Thanks.

> Make recover the pipeline in same packet exceed times for stream closed 
> configurable.
> -
>
> Key: HDFS-15856
> URL: https://issues.apache.org/jira/browse/HDFS-15856
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Now recover the pipeline five times in a row for the same packet, will close 
> the stream, but i think it should be configurable for different cluster 
> needed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15856) Make recover the pipeline in same packet exceed times for stream closed configurable.

2021-02-25 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291417#comment-17291417
 ] 

Qi Zhu edited comment on HDFS-15856 at 2/26/21, 6:33 AM:
-

[~ayushtkn] [~hexiaoqiao] 

The test is not related to this.

If you any other advice?

Thanks.


was (Author: zhuqi):
[~ayushtkn] [~hexiaoqiao] 

If you any other advice?

Thanks.

> Make recover the pipeline in same packet exceed times for stream closed 
> configurable.
> -
>
> Key: HDFS-15856
> URL: https://issues.apache.org/jira/browse/HDFS-15856
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Now recover the pipeline five times in a row for the same packet, will close 
> the stream, but i think it should be configurable for different cluster 
> needed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15856) Make recover the pipeline in same packet exceed times for stream closed configurable.

2021-02-25 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291417#comment-17291417
 ] 

Qi Zhu commented on HDFS-15856:
---

[~ayushtkn] [~hexiaoqiao] 

If you any other advice?

Thanks.

> Make recover the pipeline in same packet exceed times for stream closed 
> configurable.
> -
>
> Key: HDFS-15856
> URL: https://issues.apache.org/jira/browse/HDFS-15856
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Now recover the pipeline five times in a row for the same packet, will close 
> the stream, but i think it should be configurable for different cluster 
> needed.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2021-02-25 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290808#comment-17290808
 ] 

Qi Zhu commented on HDFS-15180:
---

Thanks [~hexiaoqiao] for your reply, great work as expected, i will help review 
when it rebased trunk and look forward the benchmark result.

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: Qi Zhu
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, 
> HDFS-15180.003.patch, HDFS-15180.004.patch, 
> image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, 
> image-2020-03-10-17-34-26-368.png, image-2020-04-09-11-20-36-459.png
>
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15856) Make recover the pipeline in same packet exceed times for stream closed configurable.

2021-02-24 Thread Qi Zhu (Jira)
Qi Zhu created HDFS-15856:
-

 Summary: Make recover the pipeline in same packet exceed times for 
stream closed configurable.
 Key: HDFS-15856
 URL: https://issues.apache.org/jira/browse/HDFS-15856
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Qi Zhu
Assignee: Qi Zhu


Now recover the pipeline five times in a row for the same packet, will close 
the stream, but i think it should be configurable for different cluster needed.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.

2021-02-24 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289853#comment-17289853
 ] 

Qi Zhu commented on HDFS-15180:
---

[~hexiaoqiao] [~sodonnell] [~Aiphag0]

Is it going on, i think we can push this now.

>  DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
> ---
>
> Key: HDFS-15180
> URL: https://issues.apache.org/jira/browse/HDFS-15180
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.2.0
>Reporter: Qi Zhu
>Assignee: Aiphago
>Priority: Major
> Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, 
> HDFS-15180.003.patch, HDFS-15180.004.patch, 
> image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, 
> image-2020-03-10-17-34-26-368.png, image-2020-04-09-11-20-36-459.png
>
>
> Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in 
> big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15584) Improve HDFS large deletion cause namenode lockqueue boom and pending deletion boom.

2021-02-23 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289721#comment-17289721
 ] 

Qi Zhu commented on HDFS-15584:
---

[~sodonnell]  [~hexiaoqiao] 

If you any other advice about this? It will be helpful to large deletion.

Thanks.

> Improve HDFS large deletion cause namenode lockqueue boom and pending 
> deletion boom.
> 
>
> Key: HDFS-15584
> URL: https://issues.apache.org/jira/browse/HDFS-15584
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Affects Versions: 3.4.0
>Reporter: Qi Zhu
>Assignee: Qi Zhu
>Priority: Major
> Attachments: HDFS-15584.001.patch
>
>
> In our production cluster, the large deletion will boom the namenode lock 
> queue, also will lead to the boom of pending deletion in invalidate blocks.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15849) ExpiredHeartbeats metric should be of Type.COUNTER

2021-02-23 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288998#comment-17288998
 ] 

Qi Zhu commented on HDFS-15849:
---

[~shv] [~tomscut]

Fixed the metric related test in latest patch.

> ExpiredHeartbeats metric should be of Type.COUNTER
> --
>
> Key: HDFS-15849
> URL: https://issues.apache.org/jira/browse/HDFS-15849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: metrics
>Reporter: Konstantin Shvachko
>Assignee: Qi Zhu
>Priority: Major
>  Labels: newbie
> Attachments: HDFS-15849.001.patch, HDFS-15849.002.patch
>
>
> Currently {{ExpiredHeartbeats}} metric has default type, which makes it 
> {{Type.GAUGE}}. It should be {{Type.COUNTER}} for proper graphing. See 
> discussion in HDFS-15808.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15849) ExpiredHeartbeats metric should be of Type.COUNTER

2021-02-23 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15849:
--
Attachment: HDFS-15849.002.patch

> ExpiredHeartbeats metric should be of Type.COUNTER
> --
>
> Key: HDFS-15849
> URL: https://issues.apache.org/jira/browse/HDFS-15849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: metrics
>Reporter: Konstantin Shvachko
>Assignee: Qi Zhu
>Priority: Major
>  Labels: newbie
> Attachments: HDFS-15849.001.patch, HDFS-15849.002.patch
>
>
> Currently {{ExpiredHeartbeats}} metric has default type, which makes it 
> {{Type.GAUGE}}. It should be {{Type.COUNTER}} for proper graphing. See 
> discussion in HDFS-15808.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15808) Add metrics for FSNamesystem read/write lock hold long time

2021-02-23 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288982#comment-17288982
 ] 

Qi Zhu commented on HDFS-15808:
---

Thanks [~tomscut] for contribution, patch LGTM. +1 

> Add metrics for FSNamesystem read/write lock hold long time
> ---
>
> Key: HDFS-15808
> URL: https://issues.apache.org/jira/browse/HDFS-15808
> Project: Hadoop HDFS
>  Issue Type: Wish
>  Components: hdfs
>Reporter: tomscut
>Assignee: tomscut
>Priority: Major
>  Labels: hdfs, lock, metrics, pull-request-available
> Attachments: ExpiredHeartbeat.png, lockLongHoldCount
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> To monitor how often read/write locks exceed thresholds, we can add two 
> metrics(ReadLockWarning/WriteLockWarning), which are exposed in JMX.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15849) ExpiredHeartbeats metric should be of Type.COUNTER

2021-02-22 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288849#comment-17288849
 ] 

Qi Zhu edited comment on HDFS-15849 at 2/23/21, 5:27 AM:
-

[~shv] [~xkrogen] [~tomscut] 

In our production cluster, we also want to delta it to show in graph.

Make it Type.COUNTER will help delta easily to graph, i will help this.

Submitted a patch for review.

Thanks. 


was (Author: zhuqi):
[~shv] 

In our production cluster, we also want to delta it to show in graph.

Make it Type.COUNTER will help delta easily to graph, i will help this.

Thanks. 

> ExpiredHeartbeats metric should be of Type.COUNTER
> --
>
> Key: HDFS-15849
> URL: https://issues.apache.org/jira/browse/HDFS-15849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: metrics
>Reporter: Konstantin Shvachko
>Assignee: Qi Zhu
>Priority: Major
>  Labels: newbie
> Attachments: HDFS-15849.001.patch
>
>
> Currently {{ExpiredHeartbeats}} metric has default type, which makes it 
> {{Type.GAUGE}}. It should be {{Type.COUNTER}} for proper graphing. See 
> discussion in HDFS-15808.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15849) ExpiredHeartbeats metric should be of Type.COUNTER

2021-02-22 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu updated HDFS-15849:
--
Attachment: HDFS-15849.001.patch
Status: Patch Available  (was: Open)

> ExpiredHeartbeats metric should be of Type.COUNTER
> --
>
> Key: HDFS-15849
> URL: https://issues.apache.org/jira/browse/HDFS-15849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: metrics
>Reporter: Konstantin Shvachko
>Assignee: Qi Zhu
>Priority: Major
>  Labels: newbie
> Attachments: HDFS-15849.001.patch
>
>
> Currently {{ExpiredHeartbeats}} metric has default type, which makes it 
> {{Type.GAUGE}}. It should be {{Type.COUNTER}} for proper graphing. See 
> discussion in HDFS-15808.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15849) ExpiredHeartbeats metric should be of Type.COUNTER

2021-02-22 Thread Qi Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17288849#comment-17288849
 ] 

Qi Zhu commented on HDFS-15849:
---

[~shv] 

In our production cluster, we also want to delta it to show in graph.

Make it Type.COUNTER will help delta easily to graph, i will help this.

Thanks. 

> ExpiredHeartbeats metric should be of Type.COUNTER
> --
>
> Key: HDFS-15849
> URL: https://issues.apache.org/jira/browse/HDFS-15849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: metrics
>Reporter: Konstantin Shvachko
>Assignee: Qi Zhu
>Priority: Major
>  Labels: newbie
>
> Currently {{ExpiredHeartbeats}} metric has default type, which makes it 
> {{Type.GAUGE}}. It should be {{Type.COUNTER}} for proper graphing. See 
> discussion in HDFS-15808.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15849) ExpiredHeartbeats metric should be of Type.COUNTER

2021-02-22 Thread Qi Zhu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Zhu reassigned HDFS-15849:
-

Assignee: Qi Zhu

> ExpiredHeartbeats metric should be of Type.COUNTER
> --
>
> Key: HDFS-15849
> URL: https://issues.apache.org/jira/browse/HDFS-15849
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: metrics
>Reporter: Konstantin Shvachko
>Assignee: Qi Zhu
>Priority: Major
>  Labels: newbie
>
> Currently {{ExpiredHeartbeats}} metric has default type, which makes it 
> {{Type.GAUGE}}. It should be {{Type.COUNTER}} for proper graphing. See 
> discussion in HDFS-15808.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org