[jira] [Resolved] (MAPREDUCE-6992) Race for temp dir in LocalDistributedCacheManager.java
[ https://issues.apache.org/jira/browse/MAPREDUCE-6992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger resolved MAPREDUCE-6992. Resolution: Duplicate I agree; this is a dupe. Thanks! > Race for temp dir in LocalDistributedCacheManager.java > -- > > Key: MAPREDUCE-6992 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6992 > Project: Hadoop Map/Reduce > Issue Type: Bug >Reporter: Philip Zeyliger > > When localizing distributed cache files in "local" mode, > LocalDistributedCacheManager.java chooses a "unique" directory based on a > millisecond time stamp. When running code with some parallelism, it's > possible to run into this. > The error message looks like > {code} > bq. java.io.FileNotFoundException: jenkins/mapred/local/1508958341829_tmp > does not exist > {code} > I ran into this in Impala's data loading. There, we run a HiveServer2 which > runs in MapReduce. If multiple queries are submitted simultaneously to the > HS2, they conflict on this directory. Googling found that StreamSets ran into > something very similar looking at > https://issues.streamsets.com/browse/SDC-5473. > I believe the buggy code is (link: > https://github.com/apache/hadoop/blob/2da654e34a436aae266c1fbdec5c1067da8d854e/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java#L94) > {code} > // Generating unique numbers for FSDownload. > AtomicLong uniqueNumberGenerator = > new AtomicLong(System.currentTimeMillis()); > {code} > Notably, a similar code path uses an actual random number generator > ({{LocalJobRunner.java}}, > https://github.com/apache/hadoop/blob/2da654e34a436aae266c1fbdec5c1067da8d854e/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java#L912). > {code} > public String getStagingAreaDir() throws IOException { > Path stagingRootDir = new Path(conf.get(JTConfig.JT_STAGING_AREA_ROOT, > "/tmp/hadoop/mapred/staging")); > UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); > String user; > randid = rand.nextInt(Integer.MAX_VALUE); > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Created] (MAPREDUCE-6992) Race for temp dir in LocalDistributedCacheManager.java
Philip Zeyliger created MAPREDUCE-6992: -- Summary: Race for temp dir in LocalDistributedCacheManager.java Key: MAPREDUCE-6992 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6992 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Philip Zeyliger When localizing distributed cache files in "local" mode, LocalDistributedCacheManager.java chooses a "unique" directory based on a millisecond time stamp. When running code with some parallelism, it's possible to run into this. The error message looks like {code} bq. java.io.FileNotFoundException: jenkins/mapred/local/1508958341829_tmp does not exist {code} I ran into this in Impala's data loading. There, we run a HiveServer2 which runs in MapReduce. If multiple queries are submitted simultaneously to the HS2, they conflict on this directory. Googling found that StreamSets ran into something very similar looking at https://issues.streamsets.com/browse/SDC-5473. I believe the buggy code is (link: https://github.com/apache/hadoop/blob/2da654e34a436aae266c1fbdec5c1067da8d854e/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalDistributedCacheManager.java#L94) {code} // Generating unique numbers for FSDownload. AtomicLong uniqueNumberGenerator = new AtomicLong(System.currentTimeMillis()); {code} Notably, a similar code path uses an actual random number generator ({{LocalJobRunner.java}}, https://github.com/apache/hadoop/blob/2da654e34a436aae266c1fbdec5c1067da8d854e/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapred/LocalJobRunner.java#L912). {code} public String getStagingAreaDir() throws IOException { Path stagingRootDir = new Path(conf.get(JTConfig.JT_STAGING_AREA_ROOT, "/tmp/hadoop/mapred/staging")); UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); String user; randid = rand.nextInt(Integer.MAX_VALUE); {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org
[jira] [Commented] (MAPREDUCE-5653) DistCp does not honour config-overrides for mapreduce.[map,reduce].memory.mb
[ https://issues.apache.org/jira/browse/MAPREDUCE-5653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14358974#comment-14358974 ] Philip Zeyliger commented on MAPREDUCE-5653: You could make an argument that DistCp, as a Yarn application, knows better than the defaults about how much memory it uses. I.e., that the bug is that DistCp isn't setting both intimately related settings ({{mapred.job.{map|reduce}.memory.mb}} and {{mapreduce.map.java.opts}}, but rather than just one. If the defaults in your cluster were to use a lot of memory, and DistCP uses very little (after all, it's copying a buffer around), it's wasteful. DistCp does not honour config-overrides for mapreduce.[map,reduce].memory.mb Key: MAPREDUCE-5653 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5653 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 0.23.9, 2.2.0 Reporter: Mithun Radhakrishnan Assignee: Ratandeep Ratti Fix For: 3.0.0 Attachments: MAPREDUCE-5653.branch-0.23.patch, MAPREDUCE-5653.branch-2.patch, MAPREDUCE-5653.trunk.2.patch, MAPREDUCE-5653.trunk.patch When a DistCp job is run through Oozie (through a Java action that launches DistCp), one sees that mapred.child.java.opts as set from the caller is honoured by DistCp. But, DistCp doesn't seem to honour any overrides for configs mapreduce.[map,reduce].memory.mb. Problem has been identified. I'll post a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-5653) DistCp does not honour config-overrides for mapreduce.[map,reduce].memory.mb
[ https://issues.apache.org/jira/browse/MAPREDUCE-5653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14359045#comment-14359045 ] Philip Zeyliger commented on MAPREDUCE-5653: Allen, do you think there's more than just this one Xmx passthrough that's affecting DistCP? There's not much smarts it needs: it's not like it's every doing anything besides copying files. Hadn't seen MAPREDUCE-5785. Agree that that's an excellent direction. DistCp does not honour config-overrides for mapreduce.[map,reduce].memory.mb Key: MAPREDUCE-5653 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5653 Project: Hadoop Map/Reduce Issue Type: Bug Components: distcp Affects Versions: 0.23.9, 2.2.0 Reporter: Mithun Radhakrishnan Assignee: Ratandeep Ratti Fix For: 3.0.0 Attachments: MAPREDUCE-5653.branch-0.23.patch, MAPREDUCE-5653.branch-2.patch, MAPREDUCE-5653.trunk.2.patch, MAPREDUCE-5653.trunk.patch When a DistCp job is run through Oozie (through a Java action that launches DistCp), one sees that mapred.child.java.opts as set from the caller is honoured by DistCp. But, DistCp doesn't seem to honour any overrides for configs mapreduce.[map,reduce].memory.mb. Problem has been identified. I'll post a patch shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-5577) Allow querying the JobHistoryServer by job arrival time
[ https://issues.apache.org/jira/browse/MAPREDUCE-5577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-5577: --- Description: The JobHistoryServer REST APIs currently allow querying by job submit time and finish time. However, jobs don't necessarily arrive in order of their finish time, meaning that a client who wants to stay on top of all completed jobs needs to query large time intervals to make sure they're not missing anything. Exposing functionality to allow querying by the time a job lands at the JobHistoryServer would allow clients to set the start of their query interval to the time of their last query. The arrival time of a job would be defined as the time that it lands in the done directory and can be picked up using the last modified date on history files. was: The JobHistoryServer REST APIs currently allow querying by job submit time and finish time. However, jobs don't necessarily arrive in order of their finish time, meaning that a client who wants to stay on top of all completed jobs needs to query large time intervals to make sure they're not missing anything. Exposing functionality to allow querying by the time a job lands at the JobHistoryServer would allow clients to set the start of their query interval to the time of their last query. The arrival time of a job would be defined as the time that it lands in the done directory and can be picked up using the last modified date on history files. Allow querying the JobHistoryServer by job arrival time --- Key: MAPREDUCE-5577 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5577 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobhistoryserver Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: MAPREDUCE-5577.patch The JobHistoryServer REST APIs currently allow querying by job submit time and finish time. However, jobs don't necessarily arrive in order of their finish time, meaning that a client who wants to stay on top of all completed jobs needs to query large time intervals to make sure they're not missing anything. Exposing functionality to allow querying by the time a job lands at the JobHistoryServer would allow clients to set the start of their query interval to the time of their last query. The arrival time of a job would be defined as the time that it lands in the done directory and can be picked up using the last modified date on history files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAPREDUCE-4469) Resource calculation in child tasks is CPU-heavy
[ https://issues.apache.org/jira/browse/MAPREDUCE-4469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13493443#comment-13493443 ] Philip Zeyliger commented on MAPREDUCE-4469: If you're looking for a resource usage of a process and its children, look at {{man getrusage}} which includes a flag to get the CPU usage of the children. Mind you, you'd need native code to get at it. Resource calculation in child tasks is CPU-heavy Key: MAPREDUCE-4469 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4469 Project: Hadoop Map/Reduce Issue Type: Bug Components: performance, task Affects Versions: 1.0.3 Reporter: Todd Lipcon Assignee: Ahmed Radwan Attachments: MAPREDUCE-4469.patch, MAPREDUCE-4469_rev2.patch, MAPREDUCE-4469_rev3.patch, MAPREDUCE-4469_rev4.patch In doing some benchmarking on a hadoop-1 derived codebase, I noticed that each of the child tasks was doing a ton of syscalls. Upon stracing, I noticed that it's spending a lot of time looping through all the files in /proc to calculate resource usage. As a test, I added a flag to disable use of the ResourceCalculatorPlugin within the tasks. On a CPU-bound 500G-sort workload, this improved total job runtime by about 10% (map slot-seconds by 14%, reduce slot seconds by 8%) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-4610) Support deprecated mapreduce.job.counters.limit property in MR2
[ https://issues.apache.org/jira/browse/MAPREDUCE-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13445110#comment-13445110 ] Philip Zeyliger commented on MAPREDUCE-4610: +1 LGTM. Support deprecated mapreduce.job.counters.limit property in MR2 --- Key: MAPREDUCE-4610 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4610 Project: Hadoop Map/Reduce Issue Type: Bug Components: mrv2 Affects Versions: 2.0.0-alpha Reporter: Tom White Assignee: Tom White Attachments: MAPREDUCE-4610.patch The property mapreduce.job.counters.limit was introduced in MAPREDUCE-1943, but the mechanism was changed in MAPREDUCE-901 where the property name was changed to mapreduce.job.counters.max without supporting the old name. We should deprecate but honour the old name to make it easier for folks to move from Hadoop 1 to Hadoop 2. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-279) Map-Reduce 2.0
[ https://issues.apache.org/jira/browse/MAPREDUCE-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13085979#comment-13085979 ] Philip Zeyliger commented on MAPREDUCE-279: --- I will return on the 24th. For urgent matters, please contact my teammates or Amr. Thanks, -- Philip Map-Reduce 2.0 -- Key: MAPREDUCE-279 URL: https://issues.apache.org/jira/browse/MAPREDUCE-279 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Reporter: Arun C Murthy Assignee: Arun C Murthy Fix For: 0.23.0 Attachments: MR-279-script.sh, MR-279.patch, MR-279.patch, MR-279.sh, MR-279_MR_files_to_move.txt, MR-279_MR_files_to_move.txt, MapReduce_NextGen_Architecture.pdf, capacity-scheduler-dark-theme.png, hadoop_contributors_meet_07_01_2011.pdf, multi-column-stable-sort-default-theme.png, post-move.patch, yarn-state-machine.job.dot, yarn-state-machine.job.png, yarn-state-machine.task-attempt.dot, yarn-state-machine.task-attempt.png, yarn-state-machine.task.dot, yarn-state-machine.task.png Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component that manages the application execution. Check it out by following [the instructions|http://goo.gl/rSJJC]. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2803) Separate client and server configs
[ https://issues.apache.org/jira/browse/MAPREDUCE-2803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13082803#comment-13082803 ] Philip Zeyliger commented on MAPREDUCE-2803: I'm a huge +1 (like +1.1 or +1.2) for separating out client and server configs, fwiw. I've seen countless folks (mostly myself, of course) get confused about whether a given config is client-side, jobtracker-side, or task-tracker side. Since configs aren't going to be compatible anyway, this is a reasonable time to try to separate that. Separate client and server configs -- Key: MAPREDUCE-2803 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2803 Project: Hadoop Map/Reduce Issue Type: Improvement Components: mrv2 Affects Versions: 0.23.0 Reporter: Luke Lu Fix For: 0.23.0 yarn-{site,default}.xml contains many knobs none-ops users don't need to know (e.g., server principals and keytab locations etc.). It's confusing to users. Let's separate the server config into separate files yarn-server-{site.default}.xml yarn common and client configs would remain in yarn-{site,default}.xml and YarnServerConfig shall read both. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-2463) Job History files are not moving to done folder when job history location is hdfs location
[ https://issues.apache.org/jira/browse/MAPREDUCE-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13027564#comment-13027564 ] Philip Zeyliger commented on MAPREDUCE-2463: I recall MAPREDUCE-2351 changing how this code path worked. Don't entirely remember the details any more... Job History files are not moving to done folder when job history location is hdfs location -- Key: MAPREDUCE-2463 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2463 Project: Hadoop Map/Reduce Issue Type: Bug Components: jobtracker Affects Versions: 0.23.0 Reporter: Devaraj K Assignee: Devaraj K If mapreduce.jobtracker.jobhistory.location is configured as HDFS location then either during initialization of Job Tracker (while moving old job history files) or after completion of the job, history files are not moving to done and giving following exception. {code:xml} 2011-04-29 15:27:27,813 ERROR org.apache.hadoop.mapreduce.jobhistory.JobHistory: Unable to move history file to DONE folder. java.lang.IllegalArgumentException: Wrong FS: hdfs://10.18.52.146:9000/history/job_201104291518_0001_root, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:402) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:58) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:419) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:294) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1516) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1492) at org.apache.hadoop.fs.FileSystem.moveFromLocalFile(FileSystem.java:1482) at org.apache.hadoop.mapreduce.jobhistory.JobHistory.moveToDoneNow(JobHistory.java:348) at org.apache.hadoop.mapreduce.jobhistory.JobHistory.access$200(JobHistory.java:61) at org.apache.hadoop.mapreduce.jobhistory.JobHistory$1.run(JobHistory.java:439) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-279) Map-Reduce 2.0
[ https://issues.apache.org/jira/browse/MAPREDUCE-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13008882#comment-13008882 ] Philip Zeyliger commented on MAPREDUCE-279: --- I'm traveling and will return to the office on Monday, March 28th. For urgent matters, please contact Aparna Ramani. Thanks! -- Philip Map-Reduce 2.0 -- Key: MAPREDUCE-279 URL: https://issues.apache.org/jira/browse/MAPREDUCE-279 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobtracker, tasktracker Reporter: Arun C Murthy Assignee: Arun C Murthy Fix For: 0.23.0 Attachments: MR-279.patch, MR-279.patch, MR-279.sh, MR-279_MR_files_to_move.txt Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component that manages the application execution. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Created: (MAPREDUCE-2381) JobTracker instrumentation not consistent about error handling
JobTracker instrumentation not consistent about error handling -- Key: MAPREDUCE-2381 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2381 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Attachments: MAPREDUCE-2381.patch.txt In the current code, if the class specified by the JobTracker instrumentation config property is not there, the JobTracker fails to start with a ClassNotFound. If it's there, but it can't load for whatever reason, the JobTracker continues with the default. Having two different error-handling routes is a bit confusing; I propose to move one line so that it's consistent. (On the TaskTracker instrumentation side, if any of the multiple instrumentations aren't available, the default is used.) The attached patch merely moves a line inside of the try block that's already there. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (MAPREDUCE-2381) JobTracker instrumentation not consistent about error handling
[ https://issues.apache.org/jira/browse/MAPREDUCE-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-2381: --- Attachment: MAPREDUCE-2381.patch.txt JobTracker instrumentation not consistent about error handling -- Key: MAPREDUCE-2381 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2381 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Attachments: MAPREDUCE-2381.patch.txt In the current code, if the class specified by the JobTracker instrumentation config property is not there, the JobTracker fails to start with a ClassNotFound. If it's there, but it can't load for whatever reason, the JobTracker continues with the default. Having two different error-handling routes is a bit confusing; I propose to move one line so that it's consistent. (On the TaskTracker instrumentation side, if any of the multiple instrumentations aren't available, the default is used.) The attached patch merely moves a line inside of the try block that's already there. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (MAPREDUCE-2381) JobTracker instrumentation not consistent about error handling
[ https://issues.apache.org/jira/browse/MAPREDUCE-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-2381: --- Assignee: Philip Zeyliger Status: Patch Available (was: Open) JobTracker instrumentation not consistent about error handling -- Key: MAPREDUCE-2381 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2381 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Assignee: Philip Zeyliger Attachments: MAPREDUCE-2381.patch.txt In the current code, if the class specified by the JobTracker instrumentation config property is not there, the JobTracker fails to start with a ClassNotFound. If it's there, but it can't load for whatever reason, the JobTracker continues with the default. Having two different error-handling routes is a bit confusing; I propose to move one line so that it's consistent. (On the TaskTracker instrumentation side, if any of the multiple instrumentations aren't available, the default is used.) The attached patch merely moves a line inside of the try block that's already there. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Updated: (MAPREDUCE-2043) TaskTrackerInstrumentation and JobTrackerInstrumentation should be public
[ https://issues.apache.org/jira/browse/MAPREDUCE-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-2043: --- Attachment: MAPREDUCE-2043.patch.txt Chris and Luke, Thanks for the context. I think the experimental developers ought to embed/extend Hadoop in a different package than org.apache.hadoop.mapred, so there's a reasonable argument for 'public', with the interface caveats. Agree wholeheartedly that this interface should be evolving. It's proven a convenient way, actually, to try some things out. I've taken Luke's suggestion and added the interface annotations. I've tested this with ant compile-core only (ant compile breaks on mumak in contrib). Attached is that new patch. -- Philip TaskTrackerInstrumentation and JobTrackerInstrumentation should be public - Key: MAPREDUCE-2043 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2043 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.22.0 Reporter: Philip Zeyliger Assignee: Philip Zeyliger Attachments: MAPREDUCE-2043.patch.txt, MAPREDUCE-2043.patch.txt Hadoop administrators can specify classes to be loaded as TaskTrackerInstrumentation and JobTrackerInstrumentation implementations, which, roughly, define listeners on TT and JT events. Unfortunately, since the class has default access, extending it requires setting the extension's package to org.apache.hadoop.mapred, which seems like poor form. I propose we make the two instrumentation classes public, so they can be extended wherever. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-2043) TaskTrackerInstrumentation and JobTrackerInstrumentation should be public
[ https://issues.apache.org/jira/browse/MAPREDUCE-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-2043: --- Attachment: MAPREDUCE-2043.patch.txt Patch attached. Here are the changed lines, to save people some clicks: {noformat} -class JobTrackerInstrumentation { +public class JobTrackerInstrumentation { -class TaskTrackerInstrumentation { +public class TaskTrackerInstrumentation { {noformat} TaskTrackerInstrumentation and JobTrackerInstrumentation should be public - Key: MAPREDUCE-2043 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2043 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.22.0 Reporter: Philip Zeyliger Assignee: Philip Zeyliger Attachments: MAPREDUCE-2043.patch.txt Hadoop administrators can specify classes to be loaded as TaskTrackerInstrumentation and JobTrackerInstrumentation implementations, which, roughly, define listeners on TT and JT events. Unfortunately, since the class has default access, extending it requires setting the extension's package to org.apache.hadoop.mapred, which seems like poor form. I propose we make the two instrumentation classes public, so they can be extended wherever. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-2043) TaskTrackerInstrumentation and JobTrackerInstrumentation should be public
TaskTrackerInstrumentation and JobTrackerInstrumentation should be public - Key: MAPREDUCE-2043 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2043 Project: Hadoop Map/Reduce Issue Type: Improvement Components: tasktracker Affects Versions: 0.22.0 Reporter: Philip Zeyliger Assignee: Philip Zeyliger Attachments: MAPREDUCE-2043.patch.txt Hadoop administrators can specify classes to be loaded as TaskTrackerInstrumentation and JobTrackerInstrumentation implementations, which, roughly, define listeners on TT and JT events. Unfortunately, since the class has default access, extending it requires setting the extension's package to org.apache.hadoop.mapred, which seems like poor form. I propose we make the two instrumentation classes public, so they can be extended wherever. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1881) Improve TaskTrackerInstrumentation
[ https://issues.apache.org/jira/browse/MAPREDUCE-1881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897334#action_12897334 ] Philip Zeyliger commented on MAPREDUCE-1881: I'll chime in that I'm using the instrumentation classes and find them a useful way to listen to some events that are otherwise hard to get at. Improve TaskTrackerInstrumentation -- Key: MAPREDUCE-1881 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1881 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Matei Zaharia Assignee: Matei Zaharia Priority: Minor Attachments: mapreduce-1881-v2.patch, mapreduce-1881-v2b.patch, mapreduce-1881.patch The TaskTrackerInstrumentation class provides a useful way to capture key events at the TaskTracker for use in various reporting tools, but it is currently rather limited, because only one TaskTrackerInstrumentation can be added to a given TaskTracker and this objects receives minimal information about tasks (only their IDs). I propose enhancing the functionality through two changes: # Support a comma-separated list of TaskTrackerInstrumentation classes rather than just a single one in the JobConf, and report events to all of them. # Make the reportTaskLaunch and reportTaskEnd methods in TaskTrackerInstrumentation receive a reference to a whole Task object rather than just its TaskAttemptID. It might also be useful to make the latter receive the task's final state, i.e. failed, killed, or successful. I'm just posting this here to get a sense of whether this is a good idea. If people think it's okay, I will make a patch against trunk that implements these changes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-220) Collecting cpu and memory usage for MapReduce tasks
[ https://issues.apache.org/jira/browse/MAPREDUCE-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12895129#action_12895129 ] Philip Zeyliger commented on MAPREDUCE-220: --- Hi Scott, You could also reset the counters to 0 when the new task is started (sort of like a tare button on a scale). If resourceCalculator.getProcCumulativeCpuTime() was rather resourceCalculator.getCumulativeCpuTimeDelta() [cumulative CPU time since last call], you could use counter.incr() for the CPU usage. It's also worth mentioning that the memory usage here is the last-known memory usage value. It's not byte-seconds (which wouldn't be that useful), nor is it maximum memory. That seems useful, but it's a bit unintuitive. {noformat} +long cpuTime = resourceCalculator.getProcCumulativeCpuTime(); +long pMem = resourceCalculator.getProcPhysicalMemorySize(); +long vMem = resourceCalculator.getProcVirtualMemorySize(); +counters.findCounter(TaskCounter.CPU_MILLISECONDS).setValue(cpuTime); +counters.findCounter(TaskCounter.PHYSICAL_MEMORY_BYTES).setValue(pMem); +counters.findCounter(TaskCounter.VIRTUAL_MEMORY_BYTES).setValue(vMem); {noformat} Collecting cpu and memory usage for MapReduce tasks --- Key: MAPREDUCE-220 URL: https://issues.apache.org/jira/browse/MAPREDUCE-220 Project: Hadoop Map/Reduce Issue Type: New Feature Components: task, tasktracker Reporter: Hong Tang Assignee: Scott Chen Fix For: 0.22.0 Attachments: MAPREDUCE-220-20100616.txt, MAPREDUCE-220-v1.txt, MAPREDUCE-220.txt It would be nice for TaskTracker to collect cpu and memory usage for individual Map or Reduce tasks over time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-220) Collecting cpu and memory usage for MapReduce tasks
[ https://issues.apache.org/jira/browse/MAPREDUCE-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894821#action_12894821 ] Philip Zeyliger commented on MAPREDUCE-220: --- Scott, Quick question: have you tried this patch with JVM re-use enabled? On my quick-reading, this patch doesn't handle that case; I don't know if it's a real problem or not. Cheers, -- Philip Collecting cpu and memory usage for MapReduce tasks --- Key: MAPREDUCE-220 URL: https://issues.apache.org/jira/browse/MAPREDUCE-220 Project: Hadoop Map/Reduce Issue Type: New Feature Components: task, tasktracker Reporter: Hong Tang Assignee: Scott Chen Fix For: 0.22.0 Attachments: MAPREDUCE-220-20100616.txt, MAPREDUCE-220-v1.txt, MAPREDUCE-220.txt It would be nice for TaskTracker to collect cpu and memory usage for individual Map or Reduce tasks over time. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1126) shuffle should use serialization to get comparator
[ https://issues.apache.org/jira/browse/MAPREDUCE-1126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12806042#action_12806042 ] Philip Zeyliger commented on MAPREDUCE-1126: @Scott: the annotations for Input/OutputFormat seem to be misplaced. It seems desirable to be able to write a single Map function that does wordcount on Strings, regardless of whether those strings are stored in newline-delimited text, sequence files, avro data files, or whatever. @Chris: 1) throwing away all Java type hierarchies. Only sometimes, no? This is only in the case where you explicitly want to do unions (and Java's union support is either Object, type hierarchies, or wrappers). In the typical case, your map functions on SomeSpecificRecordType, outputs SomeSpecificMapOutputKey/ValueType, and so forth. You still get type safety in many of the recommended use cases. shuffle should use serialization to get comparator -- Key: MAPREDUCE-1126 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1126 Project: Hadoop Map/Reduce Issue Type: Improvement Components: task Reporter: Doug Cutting Assignee: Aaron Kimball Fix For: 0.22.0 Attachments: MAPREDUCE-1126.2.patch, MAPREDUCE-1126.3.patch, MAPREDUCE-1126.4.patch, MAPREDUCE-1126.5.patch, MAPREDUCE-1126.6.patch, MAPREDUCE-1126.patch, MAPREDUCE-1126.patch Currently the key comparator is defined as a Java class. Instead we should use the Serialization API to create key comparators. This would permit, e.g., Avro-based comparators to be used, permitting efficient sorting of complex data types without having to write a RawComparator in Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1368) Vertica adapter doesn't use explicity transactions or report progress
[ https://issues.apache.org/jira/browse/MAPREDUCE-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798758#action_12798758 ] Philip Zeyliger commented on MAPREDUCE-1368: Would transactions help you? A speculative task can show up right after the map task decides to commit the transaction, and you're in the same place. Vertica adapter doesn't use explicity transactions or report progress - Key: MAPREDUCE-1368 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1368 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.21.0 Reporter: Omer Trajman Assignee: Omer Trajman Fix For: 0.21.0 The vertica adapter doesn't use explicit transactions, so speculative tasks can result in duplicate loads. The JDBC driver supports it so the fix is pretty minor. Also the JDBC driver commits synchronously and the adapter needs to report progress even if it takes longer than the timeout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1368) Vertica adapter doesn't use explicity transactions or report progress
[ https://issues.apache.org/jira/browse/MAPREDUCE-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12798810#action_12798810 ] Philip Zeyliger commented on MAPREDUCE-1368: Sorry, I wasn't clear. I think that even if you had transactions, you could still have data inserted twice. A map task looks like: (1) start map task, (2) begin transaction, (3) insert many rows, (4) commit transaction, (5) end map task. If you crash between (4) and (5), MapReduce will schedule another worker. Vertica adapter doesn't use explicity transactions or report progress - Key: MAPREDUCE-1368 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1368 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 0.21.0 Reporter: Omer Trajman Assignee: Omer Trajman Fix For: 0.21.0 The vertica adapter doesn't use explicit transactions, so speculative tasks can result in duplicate loads. The JDBC driver supports it so the fix is pretty minor. Also the JDBC driver commits synchronously and the adapter needs to report progress even if it takes longer than the timeout. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1154) Large-scale, automated test framwork for Map-Reduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769898#action_12769898 ] Philip Zeyliger commented on MAPREDUCE-1154: I'm vaguely uncomfortable with having a lot of code, even though it's test code, weaved in via AspectJ. It seems like it will make it very easy to make changes that break the testing code (because the testing code is not visible to the regular tools, and is in an unexpected place). I understand, of course, that the build system will check that the weaving can happen, but since these tests are inherently large-scale and not run at every Hudson (or are they?), it worries me a bit. Has anyone done the reverse and unweaved functions from classes? Seems like we could annotate functions with @RemoveInProduction, and then use some tool to forcibly remove methods from the resulting .class files. Still opaque, but at least it's clear where the testing code is. If you've already been working with this in AspectJ, I'm curious how the experience has been. -- Philip Large-scale, automated test framwork for Map-Reduce --- Key: MAPREDUCE-1154 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1154 Project: Hadoop Map/Reduce Issue Type: New Feature Components: test Reporter: Arun C Murthy Fix For: 0.21.0 Attachments: testing.patch HADOOP-6332 proposes a large-scale, automated, junit-based test-framework for Hadoop. This jira is meant to track relevant work to Map-Reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-989) Allow segregation of DistributedCache for maps and reduces
[ https://issues.apache.org/jira/browse/MAPREDUCE-989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756676#action_12756676 ] Philip Zeyliger commented on MAPREDUCE-989: --- The use cases definitely make sense. Unpacking archives on setup tasks is often going to be pointless. I've been thinking about what a reasonable API for this would be (especially after working on MAPREDUCE-476), from the Job submitter's role. One thought is: bq. addCacheFile(URI path, SetTaskType tasks, SetDistributedCacheOptions options); Where the default for tasks is an ImmutableSet(EnumSetTaskType) containing MAP and REDUCE. DistributedCacheOptions include {code} ADD_TO_CLASSPATH UNARCHIVE CREATE_SYMLINK {code} The defaults are to not add to classpath, not unarchive, and not create the symlink. (Note that we'd be creating symlinks per-file, instead of globally, which is the only place to set the option currently.) What I like about this is that it replaces 5 methods (addCacheFile, addCacheArchive, addFileToClassPath, addArchiveToClassPath, createSymlink), with one method, and doesn't loose much in the way of readability. You could also use booleans or enums (boolean add_to_classpath, boolean unarchive, boolean create_symlink), but that is often difficult to read. On the back-end, you'd need to revisit how the files to be cached are stored. The current scheme of using {code} mapred.cache.archives.timestamps mapred.cache.localFiles mapred.job.classpath.files mapred.job.classpath.archives mapred.cache.archives mapred.cache.files mapred.create.symlink {code} probably needs to remain for backwards compatibility, but it would be great to just stick that into one configuration property: bq. mapred.filecache = [ { path: ..., tasks: [ MAP, REDUCE ], ... }, ... ] or, if it's legal {code} mapred.filecache.0 = { path: ..., ... } mapred.filecache.1 = ... ... {code} Thoughts? Allow segregation of DistributedCache for maps and reduces -- Key: MAPREDUCE-989 URL: https://issues.apache.org/jira/browse/MAPREDUCE-989 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Reporter: Arun C Murthy Applications might have differing needs for files in the DistributedCache wrt maps and reduces. We should allow them to specify them separately. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-980) Modify JobHistory to use Avro for serialization instead of raw JSON
[ https://issues.apache.org/jira/browse/MAPREDUCE-980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756752#action_12756752 ] Philip Zeyliger commented on MAPREDUCE-980: --- My experience with generated objects (from a couple of years using protocol buffers) is that one ends up wrapping them often (preferably with composition). The generated class is responsible for serialization and deserialization, and the wrapper class is responsible for added logic. It's hard to make the generator do something reasonable for logic (or even inheritance) cross-language. Having a wrapper also allows you to have two ways to use something, in two different contexts, where you might want different surrounding logic. (So, if you had an Avro schema for an Event, the code that generates the Event might use one wrapper, and the code that consumes it might use the raw object, or have a different object.) Modify JobHistory to use Avro for serialization instead of raw JSON --- Key: MAPREDUCE-980 URL: https://issues.apache.org/jira/browse/MAPREDUCE-980 Project: Hadoop Map/Reduce Issue Type: New Feature Reporter: Jothi Padmanabhan Assignee: Doug Cutting Fix For: 0.21.0 Attachments: MAPREDUCE-980.patch, MAPREDUCE-980.patch, MAPREDUCE-980.patch, MAPREDUCE-980.patch MAPREDUCE-157 modifies JobHistory to log events using Json Format. This can be modified to use Avro instead. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-977) Missing jackson jars from Eclipse template
[ https://issues.apache.org/jira/browse/MAPREDUCE-977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756103#action_12756103 ] Philip Zeyliger commented on MAPREDUCE-977: --- +1. I can confirm that without this patch the Eclipse build is broken, and with it, it's not broken. Missing jackson jars from Eclipse template -- Key: MAPREDUCE-977 URL: https://issues.apache.org/jira/browse/MAPREDUCE-977 Project: Hadoop Map/Reduce Issue Type: Bug Components: build Reporter: Tom White Assignee: Tom White Fix For: 0.21.0 Attachments: MAPREDUCE-977.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-990) Making distributed cache getters in JobContext never return null
[ https://issues.apache.org/jira/browse/MAPREDUCE-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-990: -- Attachment: MAPREDUCE-990.patch.txt This changes javadocs and implementations of accessors to never return null. I've also made getFileTimestamps and getArchiveTimestamps package private. Ideally those interfaces aren't leaked to the user at all--the only person who accesses them is the mapreduce framework itself, so they should remain an implementation detail. Making distributed cache getters in JobContext never return null Key: MAPREDUCE-990 URL: https://issues.apache.org/jira/browse/MAPREDUCE-990 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Assignee: Philip Zeyliger Priority: Minor Attachments: MAPREDUCE-990.patch.txt MAPREDUCE-898 moved distributed cache setters and getters into Job and JobContext. Since the API is new, I'd like to propose that those getters never return null, but instead always return an array, even if it's empty. If people don't like this change, I can instead merely update the javadoc to reflect the fact that null may be returned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-990) Making distributed cache getters in JobContext never return null
[ https://issues.apache.org/jira/browse/MAPREDUCE-990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-990: -- Status: Patch Available (was: Open) Making distributed cache getters in JobContext never return null Key: MAPREDUCE-990 URL: https://issues.apache.org/jira/browse/MAPREDUCE-990 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Assignee: Philip Zeyliger Priority: Minor Attachments: MAPREDUCE-990.patch.txt MAPREDUCE-898 moved distributed cache setters and getters into Job and JobContext. Since the API is new, I'd like to propose that those getters never return null, but instead always return an array, even if it's empty. If people don't like this change, I can instead merely update the javadoc to reflect the fact that null may be returned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-990) Making distributed cache getters in JobContext never return null
[ https://issues.apache.org/jira/browse/MAPREDUCE-990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12756363#action_12756363 ] Philip Zeyliger commented on MAPREDUCE-990: --- bq. As javadoc says JobContext is a readonly view of the job provided to tasks. get*TimeStamps cannot be package private methods. Framework would call them from a different packge. The framework doesn't currently call them at all. The framework probably has more than the JobContext, so they ought to move somewhere else. I don't think they should be part of the user API--can I delete them entirely until they actually get used? -- Philip Making distributed cache getters in JobContext never return null Key: MAPREDUCE-990 URL: https://issues.apache.org/jira/browse/MAPREDUCE-990 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Assignee: Philip Zeyliger Priority: Minor Attachments: MAPREDUCE-990.patch.txt MAPREDUCE-898 moved distributed cache setters and getters into Job and JobContext. Since the API is new, I'd like to propose that those getters never return null, but instead always return an array, even if it's empty. If people don't like this change, I can instead merely update the javadoc to reflect the fact that null may be returned. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Moved: (MAPREDUCE-987) Exposing MiniDFS and MiniMR clusters as a single process command-line
[ https://issues.apache.org/jira/browse/MAPREDUCE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger moved HDFS-621 to MAPREDUCE-987: Component/s: (was: tools) (was: test) test build Key: MAPREDUCE-987 (was: HDFS-621) Project: Hadoop Map/Reduce (was: Hadoop HDFS) Exposing MiniDFS and MiniMR clusters as a single process command-line - Key: MAPREDUCE-987 URL: https://issues.apache.org/jira/browse/MAPREDUCE-987 Project: Hadoop Map/Reduce Issue Type: New Feature Components: build, test Reporter: Philip Zeyliger Assignee: Philip Zeyliger Priority: Minor Attachments: HDFS-621-0.20-patch, HDFS-621.patch It's hard to test non-Java programs that rely on significant mapreduce functionality. The patch I'm proposing shortly will let you just type bin/hadoop jar hadoop-hdfs-hdfswithmr-test.jar minicluster to start a cluster (internally, it's using Mini{MR,HDFS}Cluster) with a specified number of daemons, etc. A test that checks how some external process interacts with Hadoop might start minicluster as a subprocess, run through its thing, and then simply kill the java subprocess. I've been using just such a system for a couple of weeks, and I like it. It's significantly easier than developing a lot of scripts to start a pseudo-distributed cluster, and then clean up after it. I figure others might find it useful as well. I'm at a bit of a loss as to where to put it in 0.21. hdfs-with-mr tests have all the required libraries, so I've put it there. I could conceivably split this into minimr and minihdfs, but it's specifically the fact that they're configured to talk to each other that I like about having them together. And one JVM is better than two for my test programs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-987) Exposing MiniDFS and MiniMR clusters as a single process command-line
[ https://issues.apache.org/jira/browse/MAPREDUCE-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-987: -- Attachment: MAPREDUCE-987.patch Nicholas, Agreed that circular dependencies are to be avoided. I've moved this issue into MAPREDUCE, and spun up a new patch. Do we anticipate a world where MR doesn't depend statically on HDFS (i.e., it only depends on the FileSystem interfaces)? -- Philip Exposing MiniDFS and MiniMR clusters as a single process command-line - Key: MAPREDUCE-987 URL: https://issues.apache.org/jira/browse/MAPREDUCE-987 Project: Hadoop Map/Reduce Issue Type: New Feature Components: build, test Reporter: Philip Zeyliger Assignee: Philip Zeyliger Priority: Minor Attachments: HDFS-621-0.20-patch, HDFS-621.patch, MAPREDUCE-987.patch It's hard to test non-Java programs that rely on significant mapreduce functionality. The patch I'm proposing shortly will let you just type bin/hadoop jar hadoop-hdfs-hdfswithmr-test.jar minicluster to start a cluster (internally, it's using Mini{MR,HDFS}Cluster) with a specified number of daemons, etc. A test that checks how some external process interacts with Hadoop might start minicluster as a subprocess, run through its thing, and then simply kill the java subprocess. I've been using just such a system for a couple of weeks, and I like it. It's significantly easier than developing a lot of scripts to start a pseudo-distributed cluster, and then clean up after it. I figure others might find it useful as well. I'm at a bit of a loss as to where to put it in 0.21. hdfs-with-mr tests have all the required libraries, so I've put it there. I could conceivably split this into minimr and minihdfs, but it's specifically the fact that they're configured to talk to each other that I like about having them together. And one JVM is better than two for my test programs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-777) A method for finding and tracking jobs from the new API
[ https://issues.apache.org/jira/browse/MAPREDUCE-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754821#action_12754821 ] Philip Zeyliger commented on MAPREDUCE-777: --- I may be crazy to harm on this, but ClientProtocol still reads as very generic to me. Perhaps JobTrackerClientProtocol, to at least indicate one of the components involved? -- Philip A method for finding and tracking jobs from the new API --- Key: MAPREDUCE-777 URL: https://issues.apache.org/jira/browse/MAPREDUCE-777 Project: Hadoop Map/Reduce Issue Type: New Feature Components: client Reporter: Owen O'Malley Assignee: Amareshwari Sriramadasu Fix For: 0.21.0 Attachments: m-777.patch, patch-777-1.txt, patch-777-2.txt, patch-777-3.txt, patch-777-4.txt, patch-777-5.txt, patch-777-6.txt, patch-777-7.txt, patch-777.txt We need to create a replacement interface for the JobClient API in the new interface. In particular, the user needs to be able to query and track jobs that were launched by other processes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-973) Move FailJob from examples to test
[ https://issues.apache.org/jira/browse/MAPREDUCE-973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753976#action_12753976 ] Philip Zeyliger commented on MAPREDUCE-973: --- If you haven't done this already, I'm happy to move it to test. Should SleepJob move too? Move FailJob from examples to test Key: MAPREDUCE-973 URL: https://issues.apache.org/jira/browse/MAPREDUCE-973 Project: Hadoop Map/Reduce Issue Type: Bug Components: examples, test Affects Versions: 0.21.0 Reporter: Chris Douglas Fix For: 0.21.0 The FailJob class (MAPREDUCE-567) is more a test utility than an example. It should either move to src/test, ideally with a unit test built around it, or be removed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-777) A method for finding and tracking jobs from the new API
[ https://issues.apache.org/jira/browse/MAPREDUCE-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12751289#action_12751289 ] Philip Zeyliger commented on MAPREDUCE-777: --- Took a quick pass at your patch. Some comments, mostly documentation-related. bq. + static Counters downgrade(org.apache.hadoop.mapreduce.Counters counters) { You might have some JavaDoc for this method. Also, variables would be clearer if everything were old_counter and new_counter, since it's hard to keep track what's what. bq. ClientProtocol Are we settled on the name ClientProtocol? It's quite generic sounding, and, without the package, hard to decipher. Since these protocols will be the names of the public-ish wire APIs, perhaps JobClientProtocol would be more descriptive? bq. +public class CLI extends Configured implements Tool { Some of Hadoop uses apache.commons.cli to parse command line arguments. (And there's CLI2 too, referred to in Maven, though I don't see any usages of it. You might consider using a command-line parsing library. You might also consider splitting up the run() method into separate methods (even classes) for each piece of functionality. This will make it much easier to test, and easier to parse, too. bq. +public interface ClientProtocol extends VersionedProtocol { In the javadoc here documenting the history of this protocol, you might mention the rename. bq. Changed protocol to use new api This is not very descriptive for someone unfamiliar with this ticket. Cheers, -- Philip A method for finding and tracking jobs from the new API --- Key: MAPREDUCE-777 URL: https://issues.apache.org/jira/browse/MAPREDUCE-777 Project: Hadoop Map/Reduce Issue Type: New Feature Components: client Reporter: Owen O'Malley Assignee: Amareshwari Sriramadasu Fix For: 0.21.0 Attachments: m-777.patch, patch-777-1.txt, patch-777-2.txt, patch-777-3.txt, patch-777.txt We need to create a replacement interface for the JobClient API in the new interface. In particular, the user needs to be able to query and track jobs that were launched by other processes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-777) A method for finding and tracking jobs from the new API
[ https://issues.apache.org/jira/browse/MAPREDUCE-777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748223#action_12748223 ] Philip Zeyliger commented on MAPREDUCE-777: --- Overall, +1 on having this interface! Some thoughts: * Can getReasonForBlackList return an enum? * Is there a reason why getJobs returns Job[] and not CollectionJob? * It seems like people may want to push filters down in getJobs. * Instead of get(Map,Reduce,SetupAndCleanup)TaskReports, should that just be a getTaskReport(TaskType)? The number of task types is likely to increase. -- Philip A method for finding and tracking jobs from the new API --- Key: MAPREDUCE-777 URL: https://issues.apache.org/jira/browse/MAPREDUCE-777 Project: Hadoop Map/Reduce Issue Type: New Feature Components: client Reporter: Owen O'Malley Assignee: Amareshwari Sriramadasu Fix For: 0.21.0 Attachments: m-777.patch, patch-777-1.txt, patch-777-2.txt, patch-777.txt We need to create a replacement interface for the JobClient API in the new interface. In particular, the user needs to be able to query and track jobs that were launched by other processes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Status: Patch Available (was: Open) extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476-v7.patch, MAPREDUCE-476-v8.patch, MAPREDUCE-476-v9.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Attachment: MAPREDUCE-476-v9.patch Well-spotted, Tom. I've restored the missing test. extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476-v7.patch, MAPREDUCE-476-v8.patch, MAPREDUCE-476-v9.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747092#action_12747092 ] Philip Zeyliger commented on MAPREDUCE-476: --- Failing test is org.apache.hadoop.mapred.TestRecoveryManager.testRestartCount. I think that's failing all-over, not just here. -- Philip extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476-v7.patch, MAPREDUCE-476-v8.patch, MAPREDUCE-476-v9.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Status: Open (was: Patch Available) extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476-v7.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Status: Patch Available (was: Open) extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476-v7.patch, MAPREDUCE-476-v8.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Attachment: MAPREDUCE-476-v8.patch The test failure was spurious (org.apache.hadoop.mapred.TestRecoveryManager.testRestartCount is failing elsewhere also, and has nothing to do with this patch). But the FindBugs errors were reasonable. New patch fixes the three reported, and pasted in below. {quote} Bad practice Warnings CodeWarning DP org.apache.hadoop.mapreduce.filecache.TaskDistributedCacheManager.makeClassLoader(ClassLoader) creates a java.net.URLClassLoader classloader, which should be performed within a doPrivileged block Bug type DP_CREATE_CLASSLOADER_INSIDE_DO_PRIVILEGED (click for details) In class org.apache.hadoop.mapreduce.filecache.TaskDistributedCacheManager In method org.apache.hadoop.mapreduce.filecache.TaskDistributedCacheManager.makeClassLoader(ClassLoader) In class java.net.URLClassLoader At TaskDistributedCacheManager.java:[line 235] RV org.apache.hadoop.mapred.TaskRunner.setupWorkDir(JobConf, File) ignores exceptional return value of java.io.File.mkdir() Bug type RV_RETURN_VALUE_IGNORED_BAD_PRACTICE (click for details) In class org.apache.hadoop.mapred.TaskRunner In method org.apache.hadoop.mapred.TaskRunner.setupWorkDir(JobConf, File) Called method java.io.File.mkdir() At TaskRunner.java:[line 630] Performance Warnings CodeWarning UrF Unread field: org.apache.hadoop.mapreduce.filecache.TaskDistributedCacheManager$CacheFile.localClassPath {quote} extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476-v7.patch, MAPREDUCE-476-v8.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Attachment: v6-to-v7.patch extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Attachment: MAPREDUCE-476-v7.patch Fixing the test failures, or so I hope. The workDir handling in LocalJobRunner was confused, so I've fixed that. In general, the symlinking stuff is a bit of a complex mess: you can symlink things when you do foo#bar, and also, if you set the config variable, everything gets symlinked into the local working dir. This is used by streaming. It seems like the latter layer should be done by streaming on its own, but, alas, it may be too late to do that. Note that I've attached a v6-vs-v7.patch file to make it easier to see what the latest changes were. -- Philip extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476-v7.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Status: Open (was: Patch Available) extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-20090818.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476-v7.patch, MAPREDUCE-476.patch, v6-to-v7.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (MAPREDUCE-903) Adding AVRO jar to eclipse classpath
Adding AVRO jar to eclipse classpath Key: MAPREDUCE-903 URL: https://issues.apache.org/jira/browse/MAPREDUCE-903 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Avro is missing from the eclipse classpath, which caused Eclipse to whine. Easy fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-903) Adding AVRO jar to eclipse classpath
[ https://issues.apache.org/jira/browse/MAPREDUCE-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-903: -- Attachment: MAPREDUCE-903.patch This does not require a test---it's a configuration change for Eclipse. It's a one-line diff. Adding AVRO jar to eclipse classpath Key: MAPREDUCE-903 URL: https://issues.apache.org/jira/browse/MAPREDUCE-903 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Attachments: MAPREDUCE-903.patch Avro is missing from the eclipse classpath, which caused Eclipse to whine. Easy fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-903) Adding AVRO jar to eclipse classpath
[ https://issues.apache.org/jira/browse/MAPREDUCE-903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-903: -- Status: Patch Available (was: Open) Adding AVRO jar to eclipse classpath Key: MAPREDUCE-903 URL: https://issues.apache.org/jira/browse/MAPREDUCE-903 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Philip Zeyliger Attachments: MAPREDUCE-903.patch Avro is missing from the eclipse classpath, which caused Eclipse to whine. Easy fix. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-905) Add Eclipse launch tasks for MapReduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-905: -- Status: Patch Available (was: Open) Add Eclipse launch tasks for MapReduce -- Key: MAPREDUCE-905 URL: https://issues.apache.org/jira/browse/MAPREDUCE-905 Project: Hadoop Map/Reduce Issue Type: Improvement Environment: Eclipse 3.5 Reporter: Philip Zeyliger Priority: Minor Attachments: MAPREDUCE-905.patch This is a revival of HADOOP-5911, but only for the MR project. Eclipse has a notion of run configuration, which encapsulates what's needed to run or debug an application. I use this quite a bit to start various Hadoop daemons in debug mode, with breakpoints set, to inspect state and what not. This is simply configuration, so no tests are provided. After running ant eclipse-files and refreshing your project, you should see entries in the Run pulldown. There's a template for testing a specific test, and also templates to run all the tests, the job tracker, and a task tracker. It's likely that some parameters need to be further tweaked to have the same behavior as ant test, but for most tests, this works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12743257#action_12743257 ] Philip Zeyliger commented on MAPREDUCE-476: --- Vinod, Thanks for updating the patch! Do you have an update to MAPREDUCE-711 that has the package move? I am trying to apply MAPREDUCE-711-20090709-mapreduce.1.txt and MAPREDUCE-476-20090814.1.txt to trunk, and I think there's a mismatch in the new filecache package name. -- Philip extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-20090814.1.txt, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-157) Job History log file format is not friendly for external tools.
[ https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12742714#action_12742714 ] Philip Zeyliger commented on MAPREDUCE-157: --- Avro would force you in to a schema, and I think having a schema is the only way to get stability in the format. Yes, there's probably overhead, but if we're using Avro for other things (i.e., all RPCs), we may as well fix those overheads when we get to them. (It may also be a net win to store the data in binary avro format, and write an avrocat to deserialize into text before pushing to tools like awk, but I do understand the desire for a text format.) All that said, you have specific needs in mind here, and I'm mostly waxing poetical, so I'll certainly defer. -- Philip Job History log file format is not friendly for external tools. --- Key: MAPREDUCE-157 URL: https://issues.apache.org/jira/browse/MAPREDUCE-157 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Owen O'Malley Assignee: Jothi Padmanabhan Currently, parsing the job history logs with external tools is very difficult because of the format. The most critical problem is that newlines aren't escaped in the strings. That makes using tools like grep, sed, and awk very tricky. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-157) Job History log file format is not friendly for external tools.
[ https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741630#action_12741630 ] Philip Zeyliger commented on MAPREDUCE-157: --- Would this be a good place to try the Avro serialization format in Hadoop proper? If text-formatting is desired, AVRO-50 has a text format for Avro, which is JSON already. So you'd basically be implementing the same thing, but with the extra context of an Avro schema. -- Philip Job History log file format is not friendly for external tools. --- Key: MAPREDUCE-157 URL: https://issues.apache.org/jira/browse/MAPREDUCE-157 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Owen O'Malley Assignee: Jothi Padmanabhan Currently, parsing the job history logs with external tools is very difficult because of the format. The most critical problem is that newlines aren't escaped in the strings. That makes using tools like grep, sed, and awk very tricky. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-157) Job History log file format is not friendly for external tools.
[ https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741634#action_12741634 ] Philip Zeyliger commented on MAPREDUCE-157: --- done On Mon, Aug 10, 2009 at 1:08 AM, Jothi Padmanabhan (JIRA) Job History log file format is not friendly for external tools. --- Key: MAPREDUCE-157 URL: https://issues.apache.org/jira/browse/MAPREDUCE-157 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Owen O'Malley Assignee: Jothi Padmanabhan Currently, parsing the job history logs with external tools is very difficult because of the format. The most critical problem is that newlines aren't escaped in the strings. That makes using tools like grep, sed, and awk very tricky. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-157) Job History log file format is not friendly for external tools.
[ https://issues.apache.org/jira/browse/MAPREDUCE-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12741641#action_12741641 ] Philip Zeyliger commented on MAPREDUCE-157: --- Ack, ignore that done. Was in the wrong browser tab. Job History log file format is not friendly for external tools. --- Key: MAPREDUCE-157 URL: https://issues.apache.org/jira/browse/MAPREDUCE-157 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Owen O'Malley Assignee: Jothi Padmanabhan Currently, parsing the job history logs with external tools is very difficult because of the format. The most critical problem is that newlines aren't escaped in the strings. That makes using tools like grep, sed, and awk very tricky. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739033#action_12739033 ] Philip Zeyliger commented on MAPREDUCE-476: --- bq.I agree. A few of them are used to manage the Configuration object. (In my mind, we're serializing and de-serializing a set of requirements for the distributed cache into the text configuration, and doing so a bit haphazardly.) I was very tempted to remove all the ones that are only meant to be internal, but Tom advised me that I need to keep them deprecated for a version. Again, I think moving those methods into a more private place is a good task to do along with changing how JobClient calls into this stuff. bq. +1. So are you planning to do in the next version or in this patch itself? Next version. extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Attachment: MAPREDUCE-476-v5-requires-MR711.patch Latest patch. Thanks Vinod for the review! extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476-v5-requires-MR711.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737621#action_12737621 ] Philip Zeyliger commented on MAPREDUCE-476: --- Hi Vinod, Thanks for the ping; got distracted by other things. And thanks again for the detailed review. My responses are below. I've generated a patch that shows the differences between v2 and v4, and also the patch, in a state where it still depends on MAPREDUCE-711. is there anything blocking MAPREDUCE-711 that prevents it from being committed? Also, sorry about the multiple uploads here. I had a very clever bug in there (caused by not thinking enough while resolving a merge conflict) that deleted the current working directory, recursively, in one of the tests. (TaskRunner is hard-coded to delete current working directory, which is ok, since it's typically a child process; not ok for LocalJobRunner.) I've run the relevant tests; the full tests take a while, so I'm running those in the background. {quote} $for i in TestMRWithDistributedCache TestMiniMRLocalFS TestMiniMRDFSCaching TestTrackerDistributedCacheManager; do; ant test -Dtestcase=$i test-out-$i echo $i good || echo $i bad; done TestMRWithDistributedCache good TestMiniMRLocalFS good TestMiniMRDFSCaching good TestTrackerDistributedCacheManager good {quote} bq. There is quite a bit of refactoring in this patch, though I find it really useful. Yep. Having DistributedCache work locally is easy if you refactor the code a bit, so that's how I went at it. bq. Please make sure that in the newly added code, lines aren't longer than 80 characters. For e.g, see DistributedCacheManager.newTaskHandle() method. A handful of git diff foo..bar | egrep ^\+\+\+|^\+ .{80} has done the trick, I think. The tricky bit is always fixing only the lines I've changed, and not all the lines in a given file, to preserve history and keep reviewing sane. bq. Just a thought, can the classes be better renamed to reflect their usage, something like TrackerDistributedCacheManager and TaskDistributeCacheManager? I like those names better; thanks. Changed. bq. DistributedCacheManager and DistributedCacheHandle: Explicitly state in javadoc that it is not a public interface Done. bq. This class should also have the variable number argument getLocalCache() methods so that the corresponding methods in DistributedCache can be deprecated. Also, each method in DistributedCache should call the correponding method in DistributedCacheManager class. Don't think I agree here. We can deprecate the getLocalCache methods in DistributedCache right away. They delegate to each other, and one of them delegates to TrackerDistributedCacheManager. Ideally, I'd remove these altogether --- Hadoop internally does not use these methods with this patch, and there's no sensible reason why someone else would, but since it's public, it's getting deprecated. But it's not being deprecated with a pointer to use something else; it's getting deprecated so that you don't use it at all. bq. DistributedCacheHandle CacheFile.makeCacheFiles() bq. isClassPath can be renamed to shouldBePutInClasspath Renamed to shouldBeAddedToClasspath. bq. paths can be renamed to pathsToPutInClasspath. Renamed to pathsToBeAddedToClasspath bq. Use .equals method at +150 if (cacheFile.type == CacheFile.FileType.ARCHIVE) I believe that technically it doesn't matter. The JDK implementation of equals() on java.lang.Enum is final, and hardcoded to this==other. This is the only thing that makes sense, since there's only ever one instance of a given Enum. I took an inaccurate look at the code base, and == is the more common option. {quote} # Inaccurate! Not a static analysis! Not even close! ;) [1]doorstop:hadoop-mapreduce(140142)$ack \([a-zA-Z]*\.[a-zA-Z]*\.equals src | wc -l 11 [0]doorstop:hadoop-mapreduce(140143)$ack \([a-zA-Z]*\.[a-zA-Z]* == src | wc -l 127 {quote} If you feel strongly about this, happy to change it, but I think == is more consistent. bq. makeCacheFiles: boolean isArchive - FileType fileType done. bq. I think it would be cleaner to return target instead of passing it as an argument. Done. bq. makeCacheFiles() method should be documented Done. bq. setup() This method is really useful, avoids a lot of code duplication! Ok. bq. Leave localTaskFile writing business back in TaskRunner itself. I think It is the task's responsiblity, not the DistributeCacheHandle's Good call; done. bq. cacheSubdir can better be an argument to setup() method instead of passing it to the constructor. Good idea; done. bq. getClassPaths() : Document that it has to be called and useful only when is already invoked. Done. I've made it throw an exception if it's called erroneously, since I could see that causing trouble for developers. bq. TaskTracker.initialize() A new DistributedCacheManager is created every
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Attachment: MAPREDUCE-476-v4-requires-MR711.patch MAPREDUCE-476-v2-vs-v4.txt extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2-vs-v4.txt, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476-v4-requires-MR711.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737385#action_12737385 ] Philip Zeyliger commented on MAPREDUCE-476: --- Vinod, Yes. I've been hacking away at it today. Please ignore those last two updated diffs: while getting rid of some 80+ character lines, I fumbled some git stuff and produced bad patches. I'll be producing good ones after some more sanity checking either late today or tomorrow morning. -- Philip extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737391#action_12737391 ] Philip Zeyliger commented on MAPREDUCE-476: --- Never mind, trying to rush before leaving the office, and the tests fail here. Back tomorrow. extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-v2-vs-v3.patch, MAPREDUCE-476-v2-vs-v3.try2.patch, MAPREDUCE-476-v2.patch, MAPREDUCE-476-v3.patch, MAPREDUCE-476-v3.try2.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12732547#action_12732547 ] Philip Zeyliger commented on MAPREDUCE-476: --- Vinod, Thanks for your comments and thorough review. I'll take a closer look over the next couple of days and post a new patch. extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-v2.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-711) Move Distributed Cache from Common to Map/Reduce
[ https://issues.apache.org/jira/browse/MAPREDUCE-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12729692#action_12729692 ] Philip Zeyliger commented on MAPREDUCE-711: --- Cool; I'll produce a new patch once you upload a new one here. Do consider changing the package name from filecache to distributedcache, since two names are more confusing than one. I think people who depended on the one-jar-to-rule-them-all (the pre-split world) will assume that they must depend on all three split jars for if they don't want to worry about what ended up where. So I'm not sure you're breaking code by moving it into another jar any more than the project split already has. -- Philip Move Distributed Cache from Common to Map/Reduce Key: MAPREDUCE-711 URL: https://issues.apache.org/jira/browse/MAPREDUCE-711 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Owen O'Malley Assignee: Vinod K V Attachments: MAPREDUCE-711-20090709-common.txt, MAPREDUCE-711-20090709-mapreduce.1.txt, MAPREDUCE-711-20090709-mapreduce.txt, MAPREDUCE-711-20090710.txt Distributed Cache logically belongs as part of map/reduce and not Common. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Attachment: MAPREDUCE-476-v2.patch In light of MAPREDUCE-711, I generated a new patch. I applied MAPREDUCE-711-20090709-mapreduce.1.txt first, so this shouldn't be submitted to Hudson until after that gets checked in. I generated the patch by applying bq. cat HADOOP-2914-v3.patch | sed -e 's%src/core/%src/java/%g' | sed -e 's%src/mapred/%src/java/%g' | sed -e 's%src/test/core%src/test/mapred%g' | patch -p0 I had to clean up DistributedCache.java a tiny bit (there were 2 rejects) because some Javadoc links were removed in the project move; I've reinstated them. (I think they were removed because they pointed to MR from Common, but that's no longer an issue with MAPREDUCE-711.) extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476-v2.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (MAPREDUCE-476) extend DistributedCache to work locally (LocalJobRunner)
[ https://issues.apache.org/jira/browse/MAPREDUCE-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Zeyliger updated MAPREDUCE-476: -- Attachment: MAPREDUCE-476.patch Regenerated patch after project split. I used: bq. cat /Users/philip/Downloads/HADOOP-2914-v3.patch | sed -e 's,src/mapred/,src/java/,' | patch -p0 extend DistributedCache to work locally (LocalJobRunner) Key: MAPREDUCE-476 URL: https://issues.apache.org/jira/browse/MAPREDUCE-476 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: sam rash Assignee: Philip Zeyliger Priority: Minor Attachments: HADOOP-2914-v1-full.patch, HADOOP-2914-v1-since-4041.patch, HADOOP-2914-v2.patch, HADOOP-2914-v3.patch, MAPREDUCE-476.patch The DistributedCache does not work locally when using the outlined recipe at http://hadoop.apache.org/core/docs/r0.16.0/api/org/apache/hadoop/filecache/DistributedCache.html Ideally, LocalJobRunner would take care of populating the JobConf and copying remote files to the local file sytem (http, assume hdfs = default fs = local fs when doing local development. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.