Partitioning Reducer Output

2010-04-02 Thread rakesh kothari

Hi,
 
What's the best way to partition data generated from Reducer into multiple =
directories in Hadoop 0.20.1. I was thinking of using MultipleTextOutputFor=
mat but that's not backward compatible with other API's in this version of =
hadoop.
 
Thanks,
-Rakesh   
_
The New Busy is not the too busy. Combine all your e-mail accounts with Hotmail.
http://www.windowslive.com/campaign/thenewbusy?tile=multiaccount&ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_4

RE: Partitioning Reducer Output

2010-04-05 Thread rakesh kothari

Thanks for the insights.

My use case is more around sending the reducer output to subdirectories 
representing date partitions.

For example if the base reducer output directory is /hdfs/root/reducer/ and if 
there are two records encountered by reducer and one is timestamped with date 
2010/01/01 and other with date 2010/01/02 then the records are written to files 
in directories "/hdfs/root/reducer/2010/01/01" and 
"/hdfs/root/reducer/2010/01/02" respectively.

MultipleTextOutputFormat was designed to support such use cases but its not 
ported to 0.20.1. I was hoping if there is a workaround.

Thanks,
-Rakesh

Date: Mon, 5 Apr 2010 08:45:13 -0700
From: erez_k...@yahoo.com
Subject: Re: Partitioning Reducer Output
To: mapreduce-user@hadoop.apache.org

A partitioner can be used to control how keys are distributed across reducers 
(overriding the default 
hash(key)%num_of_reducers behavior)

I think Rakesh is asking about having multiple "types" of output from a single 
map-reduce application.

Each reducer has a tmp work directory on hdfs (pointed by jobconf by 
mapred.work.output.dir or as env var "mapred_work_output_dir if it is a 
streaming app).
The content of that folder of a reducer that completed successfully is moved to 
the actual output folder of the task.

A reducer can create other files on that folder and provided that there are no 
name collisions between reducer (meaning if the reducer number is appended to 
the file name), then one can have the output folder contain multiple types of 
outputs , something
 like

part-0
part-1
part-2
otherType-0
otherType-1
otherType-2

and later on these files can be moved around to other folders...

hope it helps,

  Erez Katz


--- On Mon, 4/5/10, David Rosenstrauch  wrote:

From: David Rosenstrauch 
Subject: Re: Partitioning Reducer Output
To: mapreduce-user@hadoop.apache.org
Date: Monday, April 5, 2010, 7:35 AM

On 04/02/2010 08:32 PM, rakesh kothari wrote:
>
> Hi,
>
> What's the best way to partition data generated from Reducer into multiple =
> directories in Hadoop 0.20.1. I was thinking of using MultipleTextOutputFor=
> mat but that's not backward compatible with other API's in this version of
 =
> hadoop.
>
> Thanks,
> -Rakesh 

Use a partitioner?

http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/Job.html#setPartitionerClass%28java.lang.Class%29

HTH,

DR

  
_
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

MRUnit Download

2010-08-20 Thread rakesh kothari

Hi,

This link: http://www.cloudera.com/hadoop-mrunit no longer points to MRUnit. 
Can someone please point out the location from where I can get it ?

Does MRUnit support Hadoop 0.20.1 ?

Thanks,
-Rakesh
  

Hdfs Block Size

2010-10-07 Thread rakesh kothari

Is there a reason why block size should be set to some 2^N, for some integer N 
? Does it help with block defragmentation etc. ?

Thanks,
-Rakesh
  

RE: Failures in the reducers

2010-10-12 Thread rakesh kothari

Thanks Shrijeet. Yeah, sorry both of these logs are from datanodes.

Also, I don't get this error when I run my job on just 1 file (450 MB).

I  wonder why this happen in the reduce stage since I just have 10 reducers and 
I don't see how those 256 connections are being opened.

-Rakesh

Date: Tue, 12 Oct 2010 13:02:16 -0700
Subject: Re: Failures in the reducers
From: shrij...@rocketfuel.com
To: mapreduce-user@hadoop.apache.org

Rakesh, That error log looks like it belonged to DataNode and not NameNode. 
Anyways try pumping the parameter named dfs.datanode.max.xcievers up (shoot for 
512). This param belongs to core-site.xml . 

-Shrijeet

On Tue, Oct 12, 2010 at 12:53 PM, rakesh kothari  
wrote:






Hi,

My MR Job is processing gzipped files each around 450 MB and there are 24 of 
them. File block size is 512 MB. 

This job is failing consistently in the reduce phase with the following 
exception (below). Any ideas how to troubleshoot this ?


Thanks,
-Rakesh

Datanode logs:



INFO
org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 10 segments
left of total size: 408736960 bytes

2010-10-12
07:25:01,020 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.185.13.61:50010

2010-10-12
07:25:01,021 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-961587459095414398_368580

2010-10-12
07:25:07,206 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.185.13.61:50010

2010-10-12
07:25:07,206 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-7795697604292519140_368580

2010-10-12
07:27:05,526 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

2010-10-12
07:27:05,527 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-7687883740524807660_368625

2010-10-12
07:27:11,713 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

2010-10-12
07:27:11,713 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-5546440551650461919_368626

2010-10-12
07:27:17,898 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

2010-10-12
07:27:17,898 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-3894897742813130478_368628

2010-10-12
07:27:24,081 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

2010-10-12
07:27:24,081 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_8687736970664350304_368652

2010-10-12
07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2812)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)

 

2010-10-12
07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_8687736970664350304_368652 bad datanode[0] nodes == null

2010-10-12
07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block
locations. Source file
"/tmp/dartlog-json-serializer/20100929_/_temporary/_attempt_201010082153_0040_r_00_2/jp/dart-imp-json/2010/09/29/17/part-r-0.gz"
- Aborting...

2010-10-12
07:27:30,196 WARN org.apache.hadoop.mapred.TaskTracker: Error running child

java.io.EOFException

   
at java.io.DataInputStream.readByte(DataInputStream.java:250)

   
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)

   
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)

   
at org.apache.hadoop.io.Text.readString(Text.java:400)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2868)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2793)

   
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)

2010-10-12
07:27:30,199 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the
task



Namenode is throwing following exception:

2010-10-12 07:27:30,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving block blk_-892355450837523222_368657 src: /10.43.102.69:42352 dest: 
/10.43.102.69:50010
2010-10-12 07:27:30,206 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
writeBlock blk_-892355450837523222_368657 received exception 
java.io.EOFException2010-10-12 07:27:30,206 ERROR 
org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(10.43.102.69:50010, 
storageID=DS-859924705-10.43.102.69-50010-1271546912162, infoPort=8501, 
ipcPor

RE: Failures in the reducers

2010-10-12 Thread rakesh kothari

No. It just runs this job. It's 7 node cluster with 3 mapper and 2 reducer slot 
per node.

Date: Tue, 12 Oct 2010 13:23:23 -0700
Subject: Re: Failures in the reducers
From: shrij...@rocketfuel.com
To: mapreduce-user@hadoop.apache.org

Is your cluster busy doing other things? (while this job is running) 

On Tue, Oct 12, 2010 at 1:15 PM, rakesh kothari  
wrote:






Thanks Shrijeet. Yeah, sorry both of these logs are from datanodes.

Also, I don't get this error when I run my job on just 1 file (450 MB).

I  wonder why this happen in the reduce stage since I just have 10 reducers and 
I don't see how those 256 connections are being opened.


-Rakesh

Date: Tue, 12 Oct 2010 13:02:16 -0700
Subject: Re: Failures in the reducers
From: shrij...@rocketfuel.com
To: mapreduce-user@hadoop.apache.org


Rakesh, That error log looks like it belonged to DataNode and not NameNode. 
Anyways try pumping the parameter named dfs.datanode.max.xcievers up (shoot for 
512). This param belongs to core-site.xml . 


-Shrijeet

On Tue, Oct 12, 2010 at 12:53 PM, rakesh kothari  
wrote:







Hi,

My MR Job is processing gzipped files each around 450 MB and there are 24 of 
them. File block size is 512 MB. 

This job is failing consistently in the reduce phase with the following 
exception (below). Any ideas how to troubleshoot this ?



Thanks,
-Rakesh

Datanode logs:



INFO
org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 10 segments
left of total size: 408736960 bytes

2010-10-12
07:25:01,020 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.185.13.61:50010

2010-10-12
07:25:01,021 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-961587459095414398_368580

2010-10-12
07:25:07,206 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink
10.185.13.61:50010

2010-10-12
07:25:07,206 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-7795697604292519140_368580

2010-10-12
07:27:05,526 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

2010-10-12
07:27:05,527 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-7687883740524807660_368625

2010-10-12
07:27:11,713 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

2010-10-12
07:27:11,713 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-5546440551650461919_368626

2010-10-12
07:27:17,898 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

2010-10-12
07:27:17,898 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block
blk_-3894897742813130478_368628

2010-10-12
07:27:24,081 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
createBlockOutputStream java.io.EOFException

2010-10-12
07:27:24,081 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block 
blk_8687736970664350304_368652

2010-10-12
07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Unable to create new block.

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2812)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)

 

2010-10-12
07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_8687736970664350304_368652 bad datanode[0] nodes == null

2010-10-12
07:27:30,186 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block
locations. Source file
"/tmp/dartlog-json-serializer/20100929_/_temporary/_attempt_201010082153_0040_r_00_2/jp/dart-imp-json/2010/09/29/17/part-r-0.gz"
- Aborting...

2010-10-12
07:27:30,196 WARN org.apache.hadoop.mapred.TaskTracker: Error running child

java.io.EOFException

   
at java.io.DataInputStream.readByte(DataInputStream.java:250)

   
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)

   
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)

   
at org.apache.hadoop.io.Text.readString(Text.java:400)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2868)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2793)

   
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)

   
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)

2010-10-12
07:27:30,199 INFO org.apache.hadoop.mapred.TaskRunner: Runnning cleanup for the
task



Namenode is throwing following exception:

2010-10-12 07:27:30,026 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving block blk_-892355450837523222_368657 src: /10.43.102.69:42352 dest: 
/10.43.10

Accessing files from distributed cache

2010-10-19 Thread rakesh kothari

Hi,

What's the way to access files copied to distributed cache from the map tasks ?

e.g.

if I run my M/R job as $hadoop jar my.jar -files hdfs://path/to/my/file.txt, 
How can I access file.txt in my Map(or reduce) task ?


Thanks,
-Rakesh
  

RE: Accessing files from distributed cache

2010-10-19 Thread rakesh kothari


I am using Hadoop 0.20.1.

-Rakesh
From: rkothari_...@hotmail.com
To: mapreduce-user@hadoop.apache.org
Subject: Accessing files from distributed cache
Date: Tue, 19 Oct 2010 13:03:04 -0700








Hi,

What's the way to access files copied to distributed cache from the map tasks ?

e.g.

if I run my M/R job as $hadoop jar my.jar -files hdfs://path/to/my/file.txt, 
How can I access file.txt in my Map(or reduce) task ?


Thanks,
-Rakesh
  

Moving files in hdfs using API

2010-10-21 Thread rakesh kothari

Hi,

Is "move" not supported in Hdfs ? I can't find any API for that. Looking at the 
source code for hadoop CLI it seems like it's implementing move by copying data 
from src to dest and deleting the src. This could be a time consuming operation.



Thanks,
-Rakesh
  

Mapper processing gzipped file

2011-01-18 Thread rakesh kothari

Hi,

There is a gzipped file that needs to be processed by a Map-only hadoop job. If 
the size of this file is more than the space reserved for non-dfs use on the 
tasktracker host processing this file and if it's a non data local map task, 
would this job eventually fail ? Is hadoop jobtracker smart enough to not 
schedule the task on such nodes ?

Thanks,
-Rakesh
  

mapred.local.dir cleanup

2011-01-18 Thread rakesh kothari

Hi,

I am seeing lots of leftover directories going back as far as 12 days in the 
task trackers "mapred.local.dir". These directories are for "M/R task attempts".

How are these directories end up in "mapred.local.dir" as  from my 
understanding these directories should be in 
"mapred.local.dir/taskTracker/jobcache/job-Id/" and should be cleaned up once 
the job finishes (or after some interval) ? How can I enable automatic cleanup 
of these directories ? 

A big chunk of these leftover directories were created the same day/time when I 
bounced my hadoop cluster. 

Any pointers is highly appreciated.

Thanks,
-Rakesh
  

RE: mapred.local.dir cleanup

2011-01-20 Thread rakesh kothari

Any ideas on how "attempt*" directories getting created directly under 
"mapred.local.dir" ? Pointers to parts of the source code would help too.

Thanks,
-Rakesh

From: rkothari_...@hotmail.com
To: mapreduce-user@hadoop.apache.org
Subject: mapred.local.dir cleanup
Date: Tue, 18 Jan 2011 17:20:04 -0800








Hi,

I am seeing lots of leftover directories going back as far as 12 days in the 
task trackers "mapred.local.dir". These directories are for "M/R task attempts".

How are these directories end up in "mapred.local.dir" as  from my 
understanding these directories should be in 
"mapred.local.dir/taskTracker/jobcache/job-Id/" and should be cleaned up once 
the job finishes (or after some interval) ? How can I enable automatic cleanup 
of these directories ? 

A big chunk of these leftover directories were created the same day/time when I 
bounced my hadoop cluster. 

Any pointers is highly appreciated.

Thanks,
-Rakesh
  

JobTracker goes into seemingly infinite loop

2011-05-05 Thread rakesh kothari

Hi,

I am using Hadoop 0.20.1. Recently we had a JobTracker outage because of the 
following:

JobTracker tries to write a file to HDFS but it's connection to primary 
datanode gets disrupted. It then subsequently enters into retry loop (that goes 
on for hours).

I see the the following message in jobtracker:



2011-05-05 10:14:44,117 WARN
org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor
exception  for block
blk_-3114565976339273197_13989812java.net.SocketTimeoutException: 69000 millis
timeout while waiting for channel to be ready for read. ch :
java.nio.channels.SocketChannel[connected local=/10.216.48.12:55432
remote=/10.216.241.26:50010]

   
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)

   
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)

   
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)

   
at java.io.DataInputStream.readFully(DataInputStream.java:178)

   
at java.io.DataInputStream.readLong(DataInputStream.java:399)

   
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2397)

 

2011-05-05 10:14:44,117 WARN
org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_-3114565976339273197_13989812 bad datanode[0] 10.216.241.26:50010

2011-05-05 10:14:44,117 WARN
org.apache.hadoop.hdfs.DFSClient: Error Recovery for block
blk_-3114565976339273197_13989812 in pipeline 10.216.241.26:50010,
10.193.31.55:50010, 10.193.31.54:50010: bad datanode 10.216.241.26:50010

2011-05-05
10:15:32,458 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file 
/progress/_logs/history/hadoop.jobtracker.com_1299948850437_job_201103121654_161356_user_myJob
retrying...

The last message that I see in namenode regarding this block is:



011-05-05 10:15:27,208 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
commitBlockSynchronization(lastblock=blk_-3114565976339273197_13989812,
newgenerationstamp=13989830, newlength=260096, newtargets=[10.193.31.54:50010],
closeFile=false, deleteBlock=false)



This problem looks similar to what these guys experienced here: 
https://groups.google.com/a/cloudera.org/group/cdh-user/browse_thread/thread/f02a5a08e50de544

Any ideas ?

Thanks,
-Rakesh


  

Failed vs Killed Tasks in Hadoop

2011-06-20 Thread rakesh kothari

Hi,

Does "maps_failed" counter includes Tasks that were killed due to speculative 
execution ?

Same with "reduces_faile" and Killed reduce tasks.

Thanks,
-Rakesh
  

EOFException when using LZO to compress map/reduce output

2011-08-14 Thread rakesh kothari

Hi,

I am using LZO to compress my intermediate map outputs.

These are the settings:
mapred.map.output.compression.codec  =  com.hadoop.compression.lzo.LzoCodec 
pig.tmpfilecompression.codec = lzo

But I am consistently getting the following exception (I dont get this 
exception when I use "gz" as pig.tmpfilecompression.codec):

Perhaps a bug ? I am using Hadoop 0.20.2 and Pig 0.8.1.
java.io.EOFException.
at 
org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:112)
at 
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:88)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
at 
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1217)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1500)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:512)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:585)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Thanks,
-Rakesh