[jira] [Commented] (MAPREDUCE-5912) Task.calculateOutputSize does not handle Windows files after MAPREDUCE-5196
[ https://issues.apache.org/jira/browse/MAPREDUCE-5912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14018694#comment-14018694 ] Remus Rusanu commented on MAPREDUCE-5912: - I also posted a patch that solves HADOOP-10663. I guess if that is accepted, this is obsolete. Task.calculateOutputSize does not handle Windows files after MAPREDUCE-5196 --- Key: MAPREDUCE-5912 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5912 Project: Hadoop Map/Reduce Issue Type: Bug Affects Versions: 3.0.0 Reporter: Remus Rusanu Assignee: Remus Rusanu Fix For: 3.0.0 Attachments: MAPREDUCE-5912.1.patch {code} @@ -1098,8 +1120,8 @@ private long calculateOutputSize() throws IOException { if (isMapTask() conf.getNumReduceTasks() 0) { try { Path mapOutput = mapOutputFile.getOutputFile(); -FileSystem localFS = FileSystem.getLocal(conf); -return localFS.getFileStatus(mapOutput).getLen(); +FileSystem fs = mapOutput.getFileSystem(conf); +return fs.getFileStatus(mapOutput).getLen(); } catch (IOException e) { LOG.warn (Could not find output size , e); } {code} causes Windows local output files to be routed through HDFS: {code} 2014-06-02 00:14:53,891 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.IllegalArgumentException: Pathname /c:/Hadoop/Data/Hadoop/local/usercache/HadoopUser/appcache/application_1401693085139_0001/output/attempt_1401693085139_0001_m_00_0/file.out from c:/Hadoop/Data/Hadoop/local/usercache/HadoopUser/appcache/application_1401693085139_0001/output/attempt_1401693085139_0001_m_00_0/file.out is not a valid DFS filename. at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:187) at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:101) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1024) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1020) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1020) at org.apache.hadoop.mapred.Task.calculateOutputSize(Task.java:1124) at org.apache.hadoop.mapred.Task.sendLastUpdate(Task.java:1102) at org.apache.hadoop.mapred.Task.done(Task.java:1048) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (MAPREDUCE-5914) Writables are not configured by framework
Abraham Elmahrek created MAPREDUCE-5914: --- Summary: Writables are not configured by framework Key: MAPREDUCE-5914 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5914 Project: Hadoop Map/Reduce Issue Type: Bug Reporter: Abraham Elmahrek Seeing the following exception: {noformat} java.lang.Exception: java.lang.NullPointerException at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:403) Caused by: java.lang.NullPointerException at org.apache.sqoop.job.io.SqoopWritable.readFields(SqoopWritable.java:59) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:129) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1248) at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:35) at org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:87) at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:63) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1582) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1467) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:699) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:769) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) {noformat} It turns out that WritableComparator does not configure Writable objects :https://github.com/apache/hadoop-common/blob/branch-2.3.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/WritableComparator.java. This is during the sort phase for an MR job. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (MAPREDUCE-1362) Pipes should be ported to the new mapreduce API
[ https://issues.apache.org/jira/browse/MAPREDUCE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Mudd updated MAPREDUCE-1362: Attachment: MAPREDUCE-1362.patch Resync'd trunk patch Pipes should be ported to the new mapreduce API --- Key: MAPREDUCE-1362 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1362 Project: Hadoop Map/Reduce Issue Type: Improvement Components: pipes Reporter: Bassam Tabbara Attachments: MAPREDUCE-1362-trunk.patch, MAPREDUCE-1362.patch, MAPREDUCE-1362.patch Pipes is still currently using the old mapred API. This prevents us from using pipes with HBase's TableInputFormat, HRegionPartitioner, etc. Here is a rough proposal for how to accomplish this: * Add a new package org.apache.hadoop.mapreduce.pipes that uses the new mapred API. * the new pipes package will run side by side with the old one. old one should get deprecated at some point. * the wire protocol used between PipesMapper and PipesReducer and C++ programs must not change. * bin/hadoop should support both pipes (old api) and pipes2 (new api) Does this sound reasonable? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-1362) Pipes should be ported to the new mapreduce API
[ https://issues.apache.org/jira/browse/MAPREDUCE-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019168#comment-14019168 ] Joe Mudd commented on MAPREDUCE-1362: - I've rebuilt the patch against the latest trunk. The latest MAPREDUCE-1362.patch is ready for a code review. Pipes should be ported to the new mapreduce API --- Key: MAPREDUCE-1362 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1362 Project: Hadoop Map/Reduce Issue Type: Improvement Components: pipes Reporter: Bassam Tabbara Attachments: MAPREDUCE-1362-trunk.patch, MAPREDUCE-1362.patch, MAPREDUCE-1362.patch Pipes is still currently using the old mapred API. This prevents us from using pipes with HBase's TableInputFormat, HRegionPartitioner, etc. Here is a rough proposal for how to accomplish this: * Add a new package org.apache.hadoop.mapreduce.pipes that uses the new mapred API. * the new pipes package will run side by side with the old one. old one should get deprecated at some point. * the wire protocol used between PipesMapper and PipesReducer and C++ programs must not change. * bin/hadoop should support both pipes (old api) and pipes2 (new api) Does this sound reasonable? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MAPREDUCE-5907) Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing
[ https://issues.apache.org/jira/browse/MAPREDUCE-5907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14019332#comment-14019332 ] Sumit Kumar commented on MAPREDUCE-5907: [~ste...@apache.org] Seeking your attention to this JIRA. Thanks! Improve getSplits() performance for fs implementations that can utilize performance gains from recursive listing Key: MAPREDUCE-5907 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5907 Project: Hadoop Map/Reduce Issue Type: Improvement Components: client Affects Versions: 2.4.0 Reporter: Sumit Kumar Assignee: Sumit Kumar Attachments: MAPREDUCE-5907-2.patch, MAPREDUCE-5907-3.patch, MAPREDUCE-5907.patch FileInputFormat (both mapreduce and mapred implementations) use recursive listing while calculating splits. They however do this by doing listing level by level. That means to discover files in /foo/bar means they do listing at /foo/bar first to get the immediate children, then make the same call on all immediate children for /foo/bar to discover their immediate children and so on. This doesn't scale well for object store based fs implementations like s3 and swift because every listStatus call ends up being a webservice call to backend. In cases where large number of files are considered for input, this makes getSplits() call slow. This patch adds a new set of recursive list apis that gives opportunity to the fs implementations to optimize. The behavior remains the same for other implementations (that is a default implementation is provided for other fs so they don't have to implement anything new). However for objectstore based fs implementations it provides a simple change to include recursive flag as true (as shown in the patch) to improve listing performance. -- This message was sent by Atlassian JIRA (v6.2#6252)