[jira] [Commented] (MAPREDUCE-7241) FileInputFormat listStatus with less memory footprint

2020-04-01 Thread Zhihua Deng (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073229#comment-17073229
 ] 

Zhihua Deng commented on MAPREDUCE-7241:


Thanks for reviewing, [~jlowe]!

> FileInputFormat listStatus with less memory footprint
> -
>
> Key: MAPREDUCE-7241
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: job submission
>Affects Versions: 2.6.1
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: MAPREDUCE-7241.03.patch, MAPREDUCE-7241.04.patch, 
> MAPREDUCE-7241.05.patch, MAPREDUCE-7241.06.patch, 
> MAPREDUCE-7241.trunk.02.patch, MAPREDUCE-7241.trunk.patch, filestatus.png
>
>
> This case sometimes sees in hive when user issues queries over all partitions 
> by mistakes. The file status cached when listing status could accumulate to 
> over 3g.  After digging into the  dumped memory, the LocatedBlock occupies 
> about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as 
> shows followed,
> !filestatus.png!
> Right now we only extract the block locations info from LocatedFileStatus,  
> the datanode infos(types) or block token are not taken into account. So there 
> is no need to cache LocatedBlock, as do like this:
> BlockLocation[] blockLocations = dedup(stat.getBlockLocations());
>  LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations);
> private static BlockLocation[] dup(BlockLocation[] blockLocations) {
>      BlockLocation[] copyLocs = new BlockLocation[blockLocations.length];
>      int i = 0;
>      for (BlockLocation location : blockLocations)
> {         copyLocs[i++] = new BlockLocation(location);     }
>     return copyLocs;
>  }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-7241) FileInputFormat listStatus with less memory footprint

2020-04-01 Thread Zhihua Deng (Jira)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihua Deng updated MAPREDUCE-7241:
---
Attachment: MAPREDUCE-7241.06.patch

> FileInputFormat listStatus with less memory footprint
> -
>
> Key: MAPREDUCE-7241
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: job submission
>Affects Versions: 2.6.1
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: MAPREDUCE-7241.03.patch, MAPREDUCE-7241.04.patch, 
> MAPREDUCE-7241.05.patch, MAPREDUCE-7241.06.patch, 
> MAPREDUCE-7241.trunk.02.patch, MAPREDUCE-7241.trunk.patch, filestatus.png
>
>
> This case sometimes sees in hive when user issues queries over all partitions 
> by mistakes. The file status cached when listing status could accumulate to 
> over 3g.  After digging into the  dumped memory, the LocatedBlock occupies 
> about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as 
> shows followed,
> !filestatus.png!
> Right now we only extract the block locations info from LocatedFileStatus,  
> the datanode infos(types) or block token are not taken into account. So there 
> is no need to cache LocatedBlock, as do like this:
> BlockLocation[] blockLocations = dedup(stat.getBlockLocations());
>  LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations);
> private static BlockLocation[] dup(BlockLocation[] blockLocations) {
>      BlockLocation[] copyLocs = new BlockLocation[blockLocations.length];
>      int i = 0;
>      for (BlockLocation location : blockLocations)
> {         copyLocs[i++] = new BlockLocation(location);     }
>     return copyLocs;
>  }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-7241) FileInputFormat listStatus with less memory footprint

2020-04-01 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072768#comment-17072768
 ] 

Hudson commented on MAPREDUCE-7241:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #18108 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/18108/])
MAPREDUCE-7241. FileInputFormat listStatus with less memory footprint. (jlowe: 
rev c613296dc85ac7b22c171c84f578106b315cc012)
* (edit) 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/LocatedFileStatusFetcher.java
* (edit) 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/lib/input/TestFileInputFormat.java
* (edit) 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java
* (edit) 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java


> FileInputFormat listStatus with less memory footprint
> -
>
> Key: MAPREDUCE-7241
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: job submission
>Affects Versions: 2.6.1
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: MAPREDUCE-7241.03.patch, MAPREDUCE-7241.04.patch, 
> MAPREDUCE-7241.05.patch, MAPREDUCE-7241.trunk.02.patch, 
> MAPREDUCE-7241.trunk.patch, filestatus.png
>
>
> This case sometimes sees in hive when user issues queries over all partitions 
> by mistakes. The file status cached when listing status could accumulate to 
> over 3g.  After digging into the  dumped memory, the LocatedBlock occupies 
> about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as 
> shows followed,
> !filestatus.png!
> Right now we only extract the block locations info from LocatedFileStatus,  
> the datanode infos(types) or block token are not taken into account. So there 
> is no need to cache LocatedBlock, as do like this:
> BlockLocation[] blockLocations = dedup(stat.getBlockLocations());
>  LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations);
> private static BlockLocation[] dup(BlockLocation[] blockLocations) {
>      BlockLocation[] copyLocs = new BlockLocation[blockLocations.length];
>      int i = 0;
>      for (BlockLocation location : blockLocations)
> {         copyLocs[i++] = new BlockLocation(location);     }
>     return copyLocs;
>  }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Updated] (MAPREDUCE-7241) FileInputFormat listStatus with less memory footprint

2020-04-01 Thread Jason Darrell Lowe (Jira)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Darrell Lowe updated MAPREDUCE-7241:
--
Fix Version/s: 3.4.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Thanks for the contribution, [~dengzh]!  I committed this to trunk.

> FileInputFormat listStatus with less memory footprint
> -
>
> Key: MAPREDUCE-7241
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: job submission
>Affects Versions: 2.6.1
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
> Fix For: 3.4.0
>
> Attachments: MAPREDUCE-7241.03.patch, MAPREDUCE-7241.04.patch, 
> MAPREDUCE-7241.05.patch, MAPREDUCE-7241.trunk.02.patch, 
> MAPREDUCE-7241.trunk.patch, filestatus.png
>
>
> This case sometimes sees in hive when user issues queries over all partitions 
> by mistakes. The file status cached when listing status could accumulate to 
> over 3g.  After digging into the  dumped memory, the LocatedBlock occupies 
> about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as 
> shows followed,
> !filestatus.png!
> Right now we only extract the block locations info from LocatedFileStatus,  
> the datanode infos(types) or block token are not taken into account. So there 
> is no need to cache LocatedBlock, as do like this:
> BlockLocation[] blockLocations = dedup(stat.getBlockLocations());
>  LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations);
> private static BlockLocation[] dup(BlockLocation[] blockLocations) {
>      BlockLocation[] copyLocs = new BlockLocation[blockLocations.length];
>      int i = 0;
>      for (BlockLocation location : blockLocations)
> {         copyLocs[i++] = new BlockLocation(location);     }
>     return copyLocs;
>  }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Assigned] (MAPREDUCE-7241) FileInputFormat listStatus with less memory footprint

2020-04-01 Thread Jason Darrell Lowe (Jira)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Darrell Lowe reassigned MAPREDUCE-7241:
-

Assignee: Zhihua Deng

> FileInputFormat listStatus with less memory footprint
> -
>
> Key: MAPREDUCE-7241
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7241
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: job submission
>Affects Versions: 2.6.1
>Reporter: Zhihua Deng
>Assignee: Zhihua Deng
>Priority: Major
> Attachments: MAPREDUCE-7241.03.patch, MAPREDUCE-7241.04.patch, 
> MAPREDUCE-7241.05.patch, MAPREDUCE-7241.trunk.02.patch, 
> MAPREDUCE-7241.trunk.patch, filestatus.png
>
>
> This case sometimes sees in hive when user issues queries over all partitions 
> by mistakes. The file status cached when listing status could accumulate to 
> over 3g.  After digging into the  dumped memory, the LocatedBlock occupies 
> about 50%(sometimes over 60%) memory that retained by LocatedFileStatus, as 
> shows followed,
> !filestatus.png!
> Right now we only extract the block locations info from LocatedFileStatus,  
> the datanode infos(types) or block token are not taken into account. So there 
> is no need to cache LocatedBlock, as do like this:
> BlockLocation[] blockLocations = dedup(stat.getBlockLocations());
>  LocatedFileStatus shrink = new LocatedFileStatus(stat, blockLocations);
> private static BlockLocation[] dup(BlockLocation[] blockLocations) {
>      BlockLocation[] copyLocs = new BlockLocation[blockLocations.length];
>      int i = 0;
>      for (BlockLocation location : blockLocations)
> {         copyLocs[i++] = new BlockLocation(location);     }
>     return copyLocs;
>  }
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org