Partitioning Reducer Output
Hi, What's the best way to partition data generated from Reducer into multiple = directories in Hadoop 0.20.1. I was thinking of using MultipleTextOutputFor= mat but that's not backward compatible with other API's in this version of = hadoop. Thanks, -Rakesh _ The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail. http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4
RE: Partitioning Reducer Output
Thanks for the insights. My use case is more around sending the reducer output to subdirectories representing date partitions. For example if the base reducer output directory is /hdfs/root/reducer/ and if there are two records encountered by reducer and one is timestamped with date 2010/01/01 and other with date 2010/01/02 then the records are written to files in directories "/hdfs/root/reducer/2010/01/01" and "/hdfs/root/reducer/2010/01/02" respectively. MultipleTextOutputFormat was designed to support such use cases but its not ported to 0.20.1. I was hoping if there is a workaround. Thanks, -Rakesh Date: Mon, 5 Apr 2010 08:45:13 -0700 From: erez_k...@yahoo.com Subject: Re: Partitioning Reducer Output To: mapreduce-user@hadoop.apache.org A partitioner can be used to control how keys are distributed across reducers (overriding the default hash(key)%num_of_reducers behavior) I think Rakesh is asking about having multiple "types" of output from a single map-reduce application. Each reducer has a tmp work directory on hdfs (pointed by jobconf by mapred.work.output.dir or as env var "mapred_work_output_dir if it is a streaming app). The content of that folder of a reducer that completed successfully is moved to the actual output folder of the task. A reducer can create other files on that folder and provided that there are no name collisions between reducer (meaning if the reducer number is appended to the file name), then one can have the output folder contain multiple types of outputs , something like part-0 part-1 part-2 otherType-0 otherType-1 otherType-2 and later on these files can be moved around to other folders... hope it helps, Erez Katz --- On Mon, 4/5/10, David Rosenstrauch wrote: From: David Rosenstrauch Subject: Re: Partitioning Reducer Output To: mapreduce-user@hadoop.apache.org Date: Monday, April 5, 2010, 7:35 AM On 04/02/2010 08:32 PM, rakesh kothari wrote: > > Hi, > > What's the best way to partition data generated from Reducer into multiple = > directories in Hadoop 0.20.1. I was thinking of using MultipleTextOutputFor= > mat but that's not backward compatible with other API's in this version of = > hadoop. > > Thanks, > -Rakesh Use a partitioner? http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/Job.html#setPartitionerClass%28java.lang.Class%29 HTH, DR _ Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox. http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
MRUnit Download
Hi, This link: http://www.cloudera.com/hadoop-mrunit no longer points to MRUnit. Can someone please point out the location from where I can get it ? Does MRUnit support Hadoop 0.20.1 ? Thanks, -Rakesh
Hdfs Block Size
Is there a reason why block size should be set to some 2^N, for some integer N ? Does it help with block defragmentation etc. ? Thanks, -Rakesh
RE: Failures in the reducers
Thanks Shrijeet. Yeah, sorry both of these logs are from datanodes. Also, I don't get this error when I run my job on just 1 file (450 MB). I wonder why this happen in the reduce stage since I just have 10 reducers and I don't see how those 256 connections are being opened. -Rakesh Date: Tue, 12 Oct 2010 13:02:16 -0700 Subject: Re: Failures in the reducers From: shrij...@rocketfuel.com To: mapreduce-user@hadoop.apache.org Rakesh, That error log looks like it belonged to DataNode and not NameNode. Anyways try pumping the parameter named dfs.datanode.max.xcievers up (shoot for 512). This param belongs to core-site.xml . -Shrijeet On Tue, Oct 12, 2010 at 12:53 PM, rakesh kothari wrote: Hi, My MR Job is processing gzipped files each around 450 MB and there are 24 of them. File block size is 512 MB. This job is failing consistently in the reduce phase with the following exception (below). Any ideas how to troubleshoot this ? Thanks, -Rakesh Datanode logs: INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 408736960 bytes 2010-10-12 07:25:01,020 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.185.13.61:50010 2010-10-12 07:25:01,021 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-961587459095414398_368580 2010-10-12 07:25:07,206 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.185.13.61:50010 2010-10-12 07:25:07,206 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-7795697604292519140_368580 2010-10-12 07:27:05,526 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-10-12 07:27:05,527 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-7687883740524807660_368625 2010-10-12 07:27:11,713 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-10-12 07:27:11,713 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5546440551650461919_368626 2010-10-12 07:27:17,898 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-10-12 07:27:17,898 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3894897742813130478_368628 2010-10-12 07:27:24,081 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-10-12 07:27:24,081 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_8687736970664350304_368652 2010-10-12 07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2812) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262) 2010-10-12 07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_8687736970664350304_368652 bad datanode[0] nodes == null 2010-10-12 07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/dartlog-json-serializer/20100929_/_temporary/_attempt_201010082153_0040_r_00_2/jp/dart-imp-json/2010/09/29/17/part-r-0.gz" - Aborting... 2010-10-12 07:27:30,196 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2868) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2793) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262) 2010-10-12 07:27:30,199 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Namenode is throwing following exception: 2010-10-12 07:27:30,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-892355450837523222_368657 src: /10.43.102.69:42352 dest: /10.43.102.69:50010 2010-10-12 07:27:30,206 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-892355450837523222_368657 received exception java.io.EOFException2010-10-12 07:27:30,206 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.43.102.69:50010, storageID=DS-859924705-10.43.102.69-50010-1271546912162, infoPort=8501, ipcPor
RE: Failures in the reducers
No. It just runs this job. It's 7 node cluster with 3 mapper and 2 reducer slot per node. Date: Tue, 12 Oct 2010 13:23:23 -0700 Subject: Re: Failures in the reducers From: shrij...@rocketfuel.com To: mapreduce-user@hadoop.apache.org Is your cluster busy doing other things? (while this job is running) On Tue, Oct 12, 2010 at 1:15 PM, rakesh kothari wrote: Thanks Shrijeet. Yeah, sorry both of these logs are from datanodes. Also, I don't get this error when I run my job on just 1 file (450 MB). I wonder why this happen in the reduce stage since I just have 10 reducers and I don't see how those 256 connections are being opened. -Rakesh Date: Tue, 12 Oct 2010 13:02:16 -0700 Subject: Re: Failures in the reducers From: shrij...@rocketfuel.com To: mapreduce-user@hadoop.apache.org Rakesh, That error log looks like it belonged to DataNode and not NameNode. Anyways try pumping the parameter named dfs.datanode.max.xcievers up (shoot for 512). This param belongs to core-site.xml . -Shrijeet On Tue, Oct 12, 2010 at 12:53 PM, rakesh kothari wrote: Hi, My MR Job is processing gzipped files each around 450 MB and there are 24 of them. File block size is 512 MB. This job is failing consistently in the reduce phase with the following exception (below). Any ideas how to troubleshoot this ? Thanks, -Rakesh Datanode logs: INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 408736960 bytes 2010-10-12 07:25:01,020 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.185.13.61:50010 2010-10-12 07:25:01,021 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-961587459095414398_368580 2010-10-12 07:25:07,206 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 10.185.13.61:50010 2010-10-12 07:25:07,206 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-7795697604292519140_368580 2010-10-12 07:27:05,526 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-10-12 07:27:05,527 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-7687883740524807660_368625 2010-10-12 07:27:11,713 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-10-12 07:27:11,713 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-5546440551650461919_368626 2010-10-12 07:27:17,898 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-10-12 07:27:17,898 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_-3894897742813130478_368628 2010-10-12 07:27:24,081 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream java.io.EOFException 2010-10-12 07:27:24,081 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_8687736970664350304_368652 2010-10-12 07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2812) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262) 2010-10-12 07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_8687736970664350304_368652 bad datanode[0] nodes == null 2010-10-12 07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/dartlog-json-serializer/20100929_/_temporary/_attempt_201010082153_0040_r_00_2/jp/dart-imp-json/2010/09/29/17/part-r-0.gz" - Aborting... 2010-10-12 07:27:30,196 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2868) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2793) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262) 2010-10-12 07:27:30,199 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the task Namenode is throwing following exception: 2010-10-12 07:27:30,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-892355450837523222_368657 src: /10.43.102.69:42352 dest: /10.43.10
Accessing files from distributed cache
Hi, What's the way to access files copied to distributed cache from the map tasks ? e.g. if I run my M/R job as $hadoop jar my.jar -files hdfs://path/to/my/file.txt, How can I access file.txt in my Map(or reduce) task ? Thanks, -Rakesh
RE: Accessing files from distributed cache
I am using Hadoop 0.20.1. -Rakesh From: rkothari_...@hotmail.com To: mapreduce-user@hadoop.apache.org Subject: Accessing files from distributed cache Date: Tue, 19 Oct 2010 13:03:04 -0700 Hi, What's the way to access files copied to distributed cache from the map tasks ? e.g. if I run my M/R job as $hadoop jar my.jar -files hdfs://path/to/my/file.txt, How can I access file.txt in my Map(or reduce) task ? Thanks, -Rakesh
Moving files in hdfs using API
Hi, Is "move" not supported in Hdfs ? I can't find any API for that. Looking at the source code for hadoop CLI it seems like it's implementing move by copying data from src to dest and deleting the src. This could be a time consuming operation. Thanks, -Rakesh
Mapper processing gzipped file
Hi, There is a gzipped file that needs to be processed by a Map-only hadoop job. If the size of this file is more than the space reserved for non-dfs use on the tasktracker host processing this file and if it's a non data local map task, would this job eventually fail ? Is hadoop jobtracker smart enough to not schedule the task on such nodes ? Thanks, -Rakesh
mapred.local.dir cleanup
Hi, I am seeing lots of leftover directories going back as far as 12 days in the task trackers "mapred.local.dir". These directories are for "M/R task attempts". How are these directories end up in "mapred.local.dir" as from my understanding these directories should be in "mapred.local.dir/taskTracker/jobcache/job-Id/" and should be cleaned up once the job finishes (or after some interval) ? How can I enable automatic cleanup of these directories ? A big chunk of these leftover directories were created the same day/time when I bounced my hadoop cluster. Any pointers is highly appreciated. Thanks, -Rakesh
RE: mapred.local.dir cleanup
Any ideas on how "attempt*" directories getting created directly under "mapred.local.dir" ? Pointers to parts of the source code would help too. Thanks, -Rakesh From: rkothari_...@hotmail.com To: mapreduce-user@hadoop.apache.org Subject: mapred.local.dir cleanup Date: Tue, 18 Jan 2011 17:20:04 -0800 Hi, I am seeing lots of leftover directories going back as far as 12 days in the task trackers "mapred.local.dir". These directories are for "M/R task attempts". How are these directories end up in "mapred.local.dir" as from my understanding these directories should be in "mapred.local.dir/taskTracker/jobcache/job-Id/" and should be cleaned up once the job finishes (or after some interval) ? How can I enable automatic cleanup of these directories ? A big chunk of these leftover directories were created the same day/time when I bounced my hadoop cluster. Any pointers is highly appreciated. Thanks, -Rakesh
JobTracker goes into seemingly infinite loop
Hi, I am using Hadoop 0.20.1. Recently we had a JobTracker outage because of the following: JobTracker tries to write a file to HDFS but it's connection to primary datanode gets disrupted. It then subsequently enters into retry loop (that goes on for hours). I see the the following message in jobtracker: 2011-05-05 10:14:44,117 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-3114565976339273197_13989812java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.216.48.12:55432 remote=/10.216.241.26:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.DataInputStream.readFully(DataInputStream.java:178) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397) 2011-05-05 10:14:44,117 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-3114565976339273197_13989812 bad datanode[0] 10.216.241.26:50010 2011-05-05 10:14:44,117 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_-3114565976339273197_13989812 in pipeline 10.216.241.26:50010, 10.193.31.55:50010, 10.193.31.54:50010: bad datanode 10.216.241.26:50010 2011-05-05 10:15:32,458 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /progress/_logs/history/hadoop.jobtracker.com_1299948850437_job_201103121654_161356_user_myJob retrying... The last message that I see in namenode regarding this block is: 011-05-05 10:15:27,208 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_-3114565976339273197_13989812, newgenerationstamp=13989830, newlength=260096, newtargets=[10.193.31.54:50010], closeFile=false, deleteBlock=false) This problem looks similar to what these guys experienced here: https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/f02a5a08e50de544 Any ideas ? Thanks, -Rakesh
Failed vs Killed Tasks in Hadoop
Hi, Does "maps_failed" counter includes Tasks that were killed due to speculative execution ? Same with "reduces_faile" and Killed reduce tasks. Thanks, -Rakesh
EOFException when using LZO to compress map/reduce output
Hi, I am using LZO to compress my intermediate map outputs. These are the settings: mapred.map.output.compression.codec = com.hadoop.compression.lzo.LzoCodec pig.tmpfilecompression.codec = lzo But I am consistently getting the following exception (I dont get this exception when I use "gz" as pig.tmpfilecompression.codec): Perhaps a bug ? I am using Hadoop 0.20.2 and Pig 0.8.1. java.io.EOFException. at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:112) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74) at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328) at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358) at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342) at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404) at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350) at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117) at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175) at org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1500) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:512) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:585) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Thanks, -Rakesh