Re: Doubt from the book Definitive Guide
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi prash1...@gmail.comwrote: Hi Mohit, What would be the advantage? Reducers in most cases read data from all the mappers. In the case where mappers were to write to HDFS, a reducer would still require to read data from other datanodes across the cluster. Only advantage I was thinking of was that in some cases reducers might be able to take advantage of data locality and avoid multiple HTTP calls, no? Data is anyways written, so last merged file could go on HDFS instead of local disk. I am new to hadoop so just asking question to understand the rational behind using local disk for final output. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: Doubt from the book Definitive Guide
On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia mohitanch...@gmail.com wrote: Only advantage I was thinking of was that in some cases reducers might be able to take advantage of data locality and avoid multiple HTTP calls, no? Data is anyways written, so last merged file could go on HDFS instead of local disk. I am new to hadoop so just asking question to understand the rational behind using local disk for final output. So basically it's a tradeoff here, you get more replicas to copy from but you have 2 more copies to write. Considering that that data's very short lived and that it doesn't need to be replicated (since if the machine fails the maps are replayed anyway) it seems that writing 2 replicas that are potentially unused would be hurtful. Regarding locality, it might make sense on a small cluster but the more you add nodes the smaller the chance to have local replicas for each blocks of data you're looking for. J-D
Doubt from the book Definitive Guide
I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? - from the book --- Mapper The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.6.4.2. The Reduce Side Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of the tasktracker that ran the map task (note that although map outputs always get written to the local disk of the map tasktracker, reduce outputs may not be), but now it is needed by the tasktracker that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.
Re: Doubt from the book Definitive Guide
Answers inline. On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia mohitanch...@gmail.comwrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// Map output is written to local FS, not HDFS. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? Map output is sent to HDFS when reducer is not used. - from the book --- Mapper The output file’s partitions are made available to the reducers over HTTP. The number of worker threads used to serve the file partitions is controlled by the tasktracker.http.threads property this setting is per tasktracker, not per map task slot. The default of 40 may need increasing for large clusters running large jobs.6.4.2. The Reduce Side Let’s turn now to the reduce part of the process. The map output file is sitting on the local disk of the tasktracker that ran the map task (note that although map outputs always get written to the local disk of the map tasktracker, reduce outputs may not be), but now it is needed by the tasktracker that is about to run the reduce task for the partition. Furthermore, the reduce task needs the map output for its particular partition from several map tasks across the cluster. The map tasks may finish at different times, so the reduce task starts copying their outputs as soon as each completes. This is known as the copy phase of the reduce task. The reduce task has a small number of copier threads so that it can fetch map outputs in parallel. The default is five threads, but this number can be changed by setting the mapred.reduce.parallel.copies property.
Re: Doubt from the book Definitive Guide
Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. I hope this helps clear some things up for you. -- Harsh J
Re: Doubt from the book Definitive Guide
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J
Re: Doubt from the book Definitive Guide
Hi Mohit, What would be the advantage? Reducers in most cases read data from all the mappers. In the case where mappers were to write to HDFS, a reducer would still require to read data from other datanodes across the cluster. Prashant On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote: On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote: Hi Mohit, On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I am going through the chapter How mapreduce works and have some confusion: 1) Below description of Mapper says that reducers get the output file using HTTP call. But the description under The Reduce Side doesn't specifically say if it's copied using HTTP. So first confusion, Is the output copied from mapper - reducer or from reducer - mapper? And second, Is the call http:// or hdfs:// The flow is simple as this: 1. For M+R job, map completes its task after writing all partitions down into the tasktracker's local filesystem (under mapred.local.dir directories). 2. Reducers fetch completion locations from events at JobTracker, and query the TaskTracker there to provide it the specific partition it needs, which is done over the TaskTracker's HTTP service (50060). So to clear things up - map doesn't send it to reduce, nor does reduce ask the actual map task. It is the task tracker itself that makes the bridge here. Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would be over Netty connections. This would be much more faster and reliable. 2) My understanding was that mapper output gets written to hdfs, since I've seen part-m-0 files in hdfs. If mapper output is written to HDFS then shouldn't reducers simply read it from hdfs instead of making http calls to tasktrackers location? A map-only job usually writes out to HDFS directly (no sorting done, cause no reducer is involved). If the job is a map+reduce one, the default output is collected to local filesystem for partitioning and sorting at map end, and eventually grouping at reduce end. Basically: Data you want to send to reducer from mapper goes to local FS for multiple actions to be performed on them, other data may directly go to HDFS. Reducers currently are scheduled pretty randomly but yes their scheduling can be improved for certain scenarios. However, if you are pointing that map partitions ought to be written to HDFS itself (with replication or without), I don't see performance improving. Note that the partitions aren't merely written but need to be sorted as well (at either end). To do that would need ability to spill frequently (cause we don't have infinite memory to do it all in RAM) and doing such a thing on HDFS would only mean slowdown. Thanks for clearing my doubts. In this case I was merely suggesting that if the mapper output (merged output in the end or the shuffle output) is stored in HDFS then reducers can just retrieve it from HDFS instead of asking tasktracker for it. Once reducer threads read it they can continue to work locally. I hope this helps clear some things up for you. -- Harsh J