date:20170403

[jira] [Updated] (MAPREDUCE-6785) ContainerLauncherImpl support for reusing the containers

2017-04-03 Thread Devaraj K (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Devaraj K updated MAPREDUCE-6785:
-
Assignee: Naganarasimha G R  (was: Devaraj K)
  Status: Open  (was: Patch Available)

Thanks [~Naganarasimha] for the patch.

Patch looks fine to me except the checkstyle warnings. I think we can handle 
these, Can you update the patch with checkstyle fixes which are showing in the 
report?

> ContainerLauncherImpl support for reusing the containers
> 
>
> Key: MAPREDUCE-6785
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6785
> Project: Hadoop Map/Reduce
>  Issue Type: Sub-task
>  Components: applicationmaster, mrv2
>Reporter: Devaraj K
>Assignee: Naganarasimha G R
> Attachments: MAPREDUCE-6785-MR-6749.003.patch, 
> MAPREDUCE-6785-MR-6749.004.patch, MAPREDUCE-6785-MR-6749.005.patch, 
> MAPREDUCE-6785-MR-6749.006.patch, MAPREDUCE-6785-v0.patch, 
> MAPREDUCE-6785-v1.patch, MAPREDUCE-6785-v2.patch
>
>
> Add support to Container Launcher for reuse of the containers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6829) Add peak memory usage counter for each task

2017-04-03 Thread Miklos Szegedi (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954433#comment-15954433
 ] 

Miklos Szegedi commented on MAPREDUCE-6829:
---

Thank you for the comment, [~mingma]. This jira primarily targeted branch-2. I 
agree that ATS should be the primary source of this information in versions 3 
and above. Just as a side note, it was important to separate map and reduce 
numbers. Is this possible with YARN-3045?

> Add peak memory usage counter for each task
> ---
>
> Key: MAPREDUCE-6829
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6829
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv2
>Reporter: Yufei Gu
>Assignee: Miklos Szegedi
> Fix For: 2.9.0
>
> Attachments: MAPREDUCE-6829.000.patch, MAPREDUCE-6829.001.patch, 
> MAPREDUCE-6829.002.patch, MAPREDUCE-6829.003.patch, MAPREDUCE-6829.004.patch, 
> MAPREDUCE-6829.005.patch
>
>
> Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which 
> are snapshots of memory usage of that task. They are not sufficient for users 
> to understand peak memory usage by that task, e.g. in order to diagnose task 
> failures, tune job parameters or change application design. This new feature 
> will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and 
> VIRTUAL_MEMORY_BYTES_MAX.
> This JIRA has the same feature from MAPREDUCE-4710.  I file this new YARN 
> JIRA since MAPREDUCE-4710 is pretty old one from MR 1.x era, it more or less 
> assumes a branch-1 architecture, should be close at this point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6846) Fragments specified for libjar paths are not handled correctly

2017-04-03 Thread Sangjin Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954409#comment-15954409
 ] 

Sangjin Lee commented on MAPREDUCE-6846:


The latest patch LGTM. [~dan...@cloudera.com] what do you think?

> Fragments specified for libjar paths are not handled correctly
> --
>
> Key: MAPREDUCE-6846
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6846
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.7.3, 3.0.0-alpha2
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>Priority: Minor
> Attachments: MAPREDUCE-6846-trunk.001.patch, 
> MAPREDUCE-6846-trunk.002.patch, MAPREDUCE-6846-trunk.003.patch, 
> MAPREDUCE-6846-trunk.004.patch, MAPREDUCE-6846-trunk.005.patch, 
> MAPREDUCE-6846-trunk.006.patch
>
>
> If a user specifies a fragment for a libjars path via generic options parser, 
> the client crashes with a FileNotFoundException:
> {noformat}
> java.io.FileNotFoundException: File file:/home/mapred/test.txt#testFrag.txt 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:638)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:864)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:628)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:363)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:314)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.copyRemoteFiles(JobResourceUploader.java:387)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadLibJars(JobResourceUploader.java:154)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:105)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:359)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:367)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
>   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> {noformat}
> This is actually inconsistent with the behavior for files and archives. Here 
> is a table showing the current behavior for each type of path and resource:
> | || Qualified path (i.e. file://home/mapred/test.txt#frag.txt) || Absolute 
> path (i.e. /home/mapred/test.txt#frag.txt) || Relative path (i.e. 
> test.txt#frag.txt) ||
> || -libjars | FileNotFound | FileNotFound|FileNotFound|
> || -files | (/) | (/) | (/) |
> || -archives | (/) | (/) | (/) |



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6829) Add peak memory usage counter for each task

2017-04-03 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954401#comment-15954401
 ] 

Ming Ma commented on MAPREDUCE-6829:


With YARN-3045, is it still necessary? Container level metrics like this seems 
to be quite useful for other frameworks other than MR and it is something YARN 
can provide if it hasn't been done.

> Add peak memory usage counter for each task
> ---
>
> Key: MAPREDUCE-6829
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6829
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: mrv2
>Reporter: Yufei Gu
>Assignee: Miklos Szegedi
> Fix For: 2.9.0
>
> Attachments: MAPREDUCE-6829.000.patch, MAPREDUCE-6829.001.patch, 
> MAPREDUCE-6829.002.patch, MAPREDUCE-6829.003.patch, MAPREDUCE-6829.004.patch, 
> MAPREDUCE-6829.005.patch
>
>
> Each task has counters PHYSICAL_MEMORY_BYTES and VIRTUAL_MEMORY_BYTES, which 
> are snapshots of memory usage of that task. They are not sufficient for users 
> to understand peak memory usage by that task, e.g. in order to diagnose task 
> failures, tune job parameters or change application design. This new feature 
> will add two more counters for each task: PHYSICAL_MEMORY_BYTES_MAX and 
> VIRTUAL_MEMORY_BYTES_MAX.
> This JIRA has the same feature from MAPREDUCE-4710.  I file this new YARN 
> JIRA since MAPREDUCE-4710 is pretty old one from MR 1.x era, it more or less 
> assumes a branch-1 architecture, should be close at this point.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6846) Fragments specified for libjar paths are not handled correctly

2017-04-03 Thread Chris Trezzo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954025#comment-15954025
 ] 

Chris Trezzo commented on MAPREDUCE-6846:
-

Any additional comments [~templedf]? Thanks!

> Fragments specified for libjar paths are not handled correctly
> --
>
> Key: MAPREDUCE-6846
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6846
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.6.0, 2.7.3, 3.0.0-alpha2
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>Priority: Minor
> Attachments: MAPREDUCE-6846-trunk.001.patch, 
> MAPREDUCE-6846-trunk.002.patch, MAPREDUCE-6846-trunk.003.patch, 
> MAPREDUCE-6846-trunk.004.patch, MAPREDUCE-6846-trunk.005.patch, 
> MAPREDUCE-6846-trunk.006.patch
>
>
> If a user specifies a fragment for a libjars path via generic options parser, 
> the client crashes with a FileNotFoundException:
> {noformat}
> java.io.FileNotFoundException: File file:/home/mapred/test.txt#testFrag.txt 
> does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:638)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:864)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:628)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:363)
>   at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:314)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.copyRemoteFiles(JobResourceUploader.java:387)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadLibJars(JobResourceUploader.java:154)
>   at 
> org.apache.hadoop.mapreduce.JobResourceUploader.uploadResources(JobResourceUploader.java:105)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:102)
>   at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:197)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1344)
>   at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
>   at org.apache.hadoop.mapreduce.Job.submit(Job.java:1341)
>   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1362)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:306)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:359)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at 
> org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:367)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
>   at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
>   at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
>   at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
> {noformat}
> This is actually inconsistent with the behavior for files and archives. Here 
> is a table showing the current behavior for each type of path and resource:
> | || Qualified path (i.e. file://home/mapred/test.txt#frag.txt) || Absolute 
> path (i.e. /home/mapred/test.txt#frag.txt) || Relative path (i.e. 
> test.txt#frag.txt) ||
> || -libjars | FileNotFound | FileNotFound|FileNotFound|
> || -files | (/) | (/) | (/) |
> || -archives | (/) | (/) | (/) |



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6824) TaskAttemptImpl#createCommonContainerLaunchContext is longer than 150 lines

2017-04-03 Thread Chris Trezzo (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953886#comment-15953886
 ] 

Chris Trezzo commented on MAPREDUCE-6824:
-

Thanks [~ajisakaa] and [~haibochen]!

> TaskAttemptImpl#createCommonContainerLaunchContext is longer than 150 lines
> ---
>
> Key: MAPREDUCE-6824
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6824
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Chris Trezzo
>Assignee: Chris Trezzo
>Priority: Trivial
>  Labels: newbie
> Fix For: 3.0.0-alpha3
>
> Attachments: MAPREDUCE-6824-trunk.001.patch, 
> MAPREDUCE-6824-trunk.002.patch, MAPREDUCE-6824-trunk.003.patch
>
>
> bq. 
> ./hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java:752:
>  private static ContainerLaunchContext createCommonContainerLaunchContext(:3: 
> Method length is 172 lines (max allowed is 150).
> {{TaskAttemptImpl#createCommonContainerLaunchContext}} is longer than 150 
> lines and needs to be refactored.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

2017-04-03 Thread Attila Sasvari (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953500#comment-15953500
 ] 

Attila Sasvari commented on MAPREDUCE-6874:
---

[~jlowe] Thanks for the detailed explanation. I was wondering why distributed 
cache accepted directory as input, but now I understand this is because of 
legacy reasons. Negative impact on performance is also clear. Closing this with 
won't fix.

> Make DistributedCache check if the content of a directory has changed
> -
>
> Key: MAPREDUCE-6874
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6874
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Attila Sasvari
>
> DistributedCache does not check recursively if the content a directory has 
> changed when adding files to it with {{DistributedCache.addCacheFile()}}. 
> h5. Background
> I have an Oozie workflow on HDFS:
> {code}
> example_workflow
> ├── job.properties
> ├── lib
> │   ├── components
> │   │   ├── sub-component.sh
> │   │   └── subsub
> │   │   └── subsub.sh
> │   ├── main.sh
> │   └── sub.sh
> └── workflow.xml
> {code}
> Executed the workflow; then made some changes in {{subsub.sh}}. Replaced the 
> file on HDFS. When I re-ran the workflow, DistributedCache did not notice the 
> changes as the timestamp on the {{components}} directory did not change. As a 
> result, the old script was materialized.
> This behaviour might be related to [determineTimestamps() 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/filecache/ClientDistributedCacheManager.java#L84].
> In order to use the new script during workflow execution, I had to update the 
> whole {{components}} directory.
> h6. Some more info:
> In Oozie, [DistributedCache.addCacheFile() 
> |https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L625]
>  is used to add files to the distributed cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Resolved] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

2017-04-03 Thread Attila Sasvari (JIRA)


 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Sasvari resolved MAPREDUCE-6874.
---
Resolution: Won't Fix

> Make DistributedCache check if the content of a directory has changed
> -
>
> Key: MAPREDUCE-6874
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6874
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Attila Sasvari
>
> DistributedCache does not check recursively if the content a directory has 
> changed when adding files to it with {{DistributedCache.addCacheFile()}}. 
> h5. Background
> I have an Oozie workflow on HDFS:
> {code}
> example_workflow
> ├── job.properties
> ├── lib
> │   ├── components
> │   │   ├── sub-component.sh
> │   │   └── subsub
> │   │   └── subsub.sh
> │   ├── main.sh
> │   └── sub.sh
> └── workflow.xml
> {code}
> Executed the workflow; then made some changes in {{subsub.sh}}. Replaced the 
> file on HDFS. When I re-ran the workflow, DistributedCache did not notice the 
> changes as the timestamp on the {{components}} directory did not change. As a 
> result, the old script was materialized.
> This behaviour might be related to [determineTimestamps() 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/filecache/ClientDistributedCacheManager.java#L84].
> In order to use the new script during workflow execution, I had to update the 
> whole {{components}} directory.
> h6. Some more info:
> In Oozie, [DistributedCache.addCacheFile() 
> |https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L625]
>  is used to add files to the distributed cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

2017-04-03 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953441#comment-15953441
 ] 

Jason Lowe commented on MAPREDUCE-6874:
---

This is a limitation with distributed cache.  It can be very expensive to do a 
full-depth traversal of a directory tree, and the API only supports one 
timestamp for a distributed cache entry.  Not only is it expensive to perform 
the stats of the tree in order to see if it is changed, it's also expensive to 
localize the files.  There's RPC overhead for each file in the tree.

It is much more efficient, and safer, for an archive (e.g.: .tar.gz, .zip, 
etc.) to be used instead of a directory.  Then there's only one timestamp we 
need to check to know if anything in the "tree" has changed.  Arguably 
directory trees shouldn't be supported in the distributed cache at all, but I 
believe they were added way back when to support use cases where a chain of 
MapReduce jobs needed the output of a previous job (i.e.: a directory) to be 
used as a cache file for the next job (e.g.: a map-side join).

> Make DistributedCache check if the content of a directory has changed
> -
>
> Key: MAPREDUCE-6874
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6874
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Attila Sasvari
>
> DistributedCache does not check recursively if the content a directory has 
> changed when adding files to it with {{DistributedCache.addCacheFile()}}. 
> h5. Background
> I have an Oozie workflow on HDFS:
> {code}
> example_workflow
> ├── job.properties
> ├── lib
> │   ├── components
> │   │   ├── sub-component.sh
> │   │   └── subsub
> │   │   └── subsub.sh
> │   ├── main.sh
> │   └── sub.sh
> └── workflow.xml
> {code}
> Executed the workflow; then made some changes in {{subsub.sh}}. Replaced the 
> file on HDFS. When I re-ran the workflow, DistributedCache did not notice the 
> changes as the timestamp on the {{components}} directory did not change. As a 
> result, the old script was materialized.
> This behaviour might be related to [determineTimestamps() 
> |https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/filecache/ClientDistributedCacheManager.java#L84].
> In order to use the new script during workflow execution, I had to update the 
> whole {{components}} directory.
> h6. Some more info:
> In Oozie, [DistributedCache.addCacheFile() 
> |https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L625]
>  is used to add files to the distributed cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Created] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

2017-04-03 Thread Attila Sasvari (JIRA)

Attila Sasvari created MAPREDUCE-6874:
-

 Summary: Make DistributedCache check if the content of a directory 
has changed
 Key: MAPREDUCE-6874
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6874
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Attila Sasvari


DistributedCache does not check recursively if the content a directory has 
changed when adding files to it with {{DistributedCache.addCacheFile()}}. 

h5. Background
I have an Oozie workflow on HDFS:
{code}
example_workflow
├── job.properties
├── lib
│   ├── components
│   │   ├── sub-component.sh
│   │   └── subsub
│   │   └── subsub.sh
│   ├── main.sh
│   └── sub.sh
└── workflow.xml
{code}
Executed the workflow; then made some changes in {{subsub.sh}}. Replaced the 
file on HDFS. When I re-ran the workflow, DistributedCache did not notice the 
changes as the timestamp on the {{components}} directory did not change. As a 
result, the old script was materialized.

This behaviour might be related to [determineTimestamps() 
|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/filecache/ClientDistributedCacheManager.java#L84].
In order to use the new script during workflow execution, I had to update the 
whole {{components}} directory.


h6. Some more info:
In Oozie, [DistributedCache.addCacheFile() 
|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/action/hadoop/JavaActionExecutor.java#L625]
 is used to add files to the distributed cache.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Updated] (MAPREDUCE-6785) ContainerLauncherImpl support for reusing the containers

[jira] [Commented] (MAPREDUCE-6829) Add peak memory usage counter for each task

[jira] [Commented] (MAPREDUCE-6846) Fragments specified for libjar paths are not handled correctly

[jira] [Commented] (MAPREDUCE-6829) Add peak memory usage counter for each task

[jira] [Commented] (MAPREDUCE-6846) Fragments specified for libjar paths are not handled correctly

[jira] [Commented] (MAPREDUCE-6824) TaskAttemptImpl#createCommonContainerLaunchContext is longer than 150 lines

[jira] [Commented] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

[jira] [Resolved] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

[jira] [Commented] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

[jira] [Created] (MAPREDUCE-6874) Make DistributedCache check if the content of a directory has changed

10 matches

Site Navigation

Mail list logo

Footer information