Re: Doubt from the book Definitive Guide

2012-04-05 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 Hi Mohit,

 What would be the advantage? Reducers in most cases read data from all
 the mappers. In the case where mappers were to write to HDFS, a
 reducer would still require to read data from other datanodes across
 the cluster.


Only advantage I was thinking of was that in some cases reducers might be
able to take advantage of data locality and avoid multiple HTTP calls, no?
Data is anyways written, so last merged file could go on HDFS instead of
local disk.
I am new to hadoop so just asking question to understand the rational
behind using local disk for final output.

 Prashant

 On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

  On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:
 
  Hi Mohit,
 
  On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
  I am going through the chapter How mapreduce works and have some
  confusion:
 
  1) Below description of Mapper says that reducers get the output file
  using
  HTTP call. But the description under The Reduce Side doesn't
  specifically
  say if it's copied using HTTP. So first confusion, Is the output copied
  from mapper - reducer or from reducer - mapper? And second, Is the
 call
  http:// or hdfs://
 
  The flow is simple as this:
  1. For M+R job, map completes its task after writing all partitions
  down into the tasktracker's local filesystem (under mapred.local.dir
  directories).
  2. Reducers fetch completion locations from events at JobTracker, and
  query the TaskTracker there to provide it the specific partition it
  needs, which is done over the TaskTracker's HTTP service (50060).
 
  So to clear things up - map doesn't send it to reduce, nor does reduce
  ask the actual map task. It is the task tracker itself that makes the
  bridge here.
 
  Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
  be over Netty connections. This would be much more faster and
  reliable.
 
  2) My understanding was that mapper output gets written to hdfs, since
  I've
  seen part-m-0 files in hdfs. If mapper output is written to HDFS
 then
  shouldn't reducers simply read it from hdfs instead of making http
 calls
  to
  tasktrackers location?
 
  A map-only job usually writes out to HDFS directly (no sorting done,
  cause no reducer is involved). If the job is a map+reduce one, the
  default output is collected to local filesystem for partitioning and
  sorting at map end, and eventually grouping at reduce end. Basically:
  Data you want to send to reducer from mapper goes to local FS for
  multiple actions to be performed on them, other data may directly go
  to HDFS.
 
  Reducers currently are scheduled pretty randomly but yes their
  scheduling can be improved for certain scenarios. However, if you are
  pointing that map partitions ought to be written to HDFS itself (with
  replication or without), I don't see performance improving. Note that
  the partitions aren't merely written but need to be sorted as well (at
  either end). To do that would need ability to spill frequently (cause
  we don't have infinite memory to do it all in RAM) and doing such a
  thing on HDFS would only mean slowdown.
 
  Thanks for clearing my doubts. In this case I was merely suggesting that
  if the mapper output (merged output in the end or the shuffle output) is
  stored in HDFS then reducers can just retrieve it from HDFS instead of
  asking tasktracker for it. Once reducer threads read it they can continue
  to work locally.
 
 
 
  I hope this helps clear some things up for you.
 
  --
  Harsh J
 



Re: Doubt from the book Definitive Guide

2012-04-05 Thread Jean-Daniel Cryans
On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 Only advantage I was thinking of was that in some cases reducers might be
 able to take advantage of data locality and avoid multiple HTTP calls, no?
 Data is anyways written, so last merged file could go on HDFS instead of
 local disk.
 I am new to hadoop so just asking question to understand the rational
 behind using local disk for final output.

So basically it's a tradeoff here, you get more replicas to copy from
but you have 2 more copies to write. Considering that that data's very
short lived and that it doesn't need to be replicated (since if the
machine fails the maps are replayed anyway) it seems that writing 2
replicas that are potentially unused would be hurtful.

Regarding locality, it might make sense on a small cluster but the
more you add nodes the smaller the chance to have local replicas for
each blocks of data you're looking for.

J-D


Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
I am going through the chapter How mapreduce works and have some
confusion:

1) Below description of Mapper says that reducers get the output file using
HTTP call. But the description under The Reduce Side doesn't specifically
say if it's copied using HTTP. So first confusion, Is the output copied
from mapper - reducer or from reducer - mapper? And second, Is the call
http:// or hdfs://

2) My understanding was that mapper output gets written to hdfs, since I've
seen part-m-0 files in hdfs. If mapper output is written to HDFS then
shouldn't reducers simply read it from hdfs instead of making http calls to
tasktrackers location?



- from the book ---
Mapper
The output file’s partitions are made available to the reducers over HTTP.
The number of worker threads used to serve the file partitions is
controlled by the tasktracker.http.threads property
this setting is per tasktracker, not per map task slot. The default of 40
may need increasing for large clusters running large jobs.6.4.2.

The Reduce Side
Let’s turn now to the reduce part of the process. The map output file is
sitting on the local disk of the tasktracker that ran the map task
(note that although map outputs always get written to the local disk of the
map tasktracker, reduce outputs may not be), but now it is needed by the
tasktracker
that is about to run the reduce task for the partition. Furthermore, the
reduce task needs the map output for its particular partition from several
map tasks across the cluster.
The map tasks may finish at different times, so the reduce task starts
copying their outputs as soon as each completes. This is known as the copy
phase of the reduce task.
The reduce task has a small number of copier threads so that it can fetch
map outputs in parallel.
The default is five threads, but this number can be changed by setting the
mapred.reduce.parallel.copies property.


Re: Doubt from the book Definitive Guide

2012-04-04 Thread Prashant Kommireddi
Answers inline.

On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file using
 HTTP call. But the description under The Reduce Side doesn't specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://


Map output is written to local FS, not HDFS.


 2) My understanding was that mapper output gets written to hdfs, since I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls to
 tasktrackers location?

 Map output is sent to HDFS when reducer is not used.



 - from the book ---
 Mapper
 The output file’s partitions are made available to the reducers over HTTP.
 The number of worker threads used to serve the file partitions is
 controlled by the tasktracker.http.threads property
 this setting is per tasktracker, not per map task slot. The default of 40
 may need increasing for large clusters running large jobs.6.4.2.

 The Reduce Side
 Let’s turn now to the reduce part of the process. The map output file is
 sitting on the local disk of the tasktracker that ran the map task
 (note that although map outputs always get written to the local disk of the
 map tasktracker, reduce outputs may not be), but now it is needed by the
 tasktracker
 that is about to run the reduce task for the partition. Furthermore, the
 reduce task needs the map output for its particular partition from several
 map tasks across the cluster.
 The map tasks may finish at different times, so the reduce task starts
 copying their outputs as soon as each completes. This is known as the copy
 phase of the reduce task.
 The reduce task has a small number of copier threads so that it can fetch
 map outputs in parallel.
 The default is five threads, but this number can be changed by setting the
 mapred.reduce.parallel.copies property.



Re: Doubt from the book Definitive Guide

2012-04-04 Thread Harsh J
Hi Mohit,

On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file using
 HTTP call. But the description under The Reduce Side doesn't specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://

The flow is simple as this:
1. For M+R job, map completes its task after writing all partitions
down into the tasktracker's local filesystem (under mapred.local.dir
directories).
2. Reducers fetch completion locations from events at JobTracker, and
query the TaskTracker there to provide it the specific partition it
needs, which is done over the TaskTracker's HTTP service (50060).

So to clear things up - map doesn't send it to reduce, nor does reduce
ask the actual map task. It is the task tracker itself that makes the
bridge here.

Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
be over Netty connections. This would be much more faster and
reliable.

 2) My understanding was that mapper output gets written to hdfs, since I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls to
 tasktrackers location?

A map-only job usually writes out to HDFS directly (no sorting done,
cause no reducer is involved). If the job is a map+reduce one, the
default output is collected to local filesystem for partitioning and
sorting at map end, and eventually grouping at reduce end. Basically:
Data you want to send to reducer from mapper goes to local FS for
multiple actions to be performed on them, other data may directly go
to HDFS.

Reducers currently are scheduled pretty randomly but yes their
scheduling can be improved for certain scenarios. However, if you are
pointing that map partitions ought to be written to HDFS itself (with
replication or without), I don't see performance improving. Note that
the partitions aren't merely written but need to be sorted as well (at
either end). To do that would need ability to spill frequently (cause
we don't have infinite memory to do it all in RAM) and doing such a
thing on HDFS would only mean slowdown.

I hope this helps clear some things up for you.

-- 
Harsh J


Re: Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

 On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am going through the chapter How mapreduce works and have some
  confusion:
 
  1) Below description of Mapper says that reducers get the output file
 using
  HTTP call. But the description under The Reduce Side doesn't
 specifically
  say if it's copied using HTTP. So first confusion, Is the output copied
  from mapper - reducer or from reducer - mapper? And second, Is the call
  http:// or hdfs://

 The flow is simple as this:
 1. For M+R job, map completes its task after writing all partitions
 down into the tasktracker's local filesystem (under mapred.local.dir
 directories).
 2. Reducers fetch completion locations from events at JobTracker, and
 query the TaskTracker there to provide it the specific partition it
 needs, which is done over the TaskTracker's HTTP service (50060).

 So to clear things up - map doesn't send it to reduce, nor does reduce
 ask the actual map task. It is the task tracker itself that makes the
 bridge here.

 Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
 be over Netty connections. This would be much more faster and
 reliable.

  2) My understanding was that mapper output gets written to hdfs, since
 I've
  seen part-m-0 files in hdfs. If mapper output is written to HDFS then
  shouldn't reducers simply read it from hdfs instead of making http calls
 to
  tasktrackers location?

 A map-only job usually writes out to HDFS directly (no sorting done,
 cause no reducer is involved). If the job is a map+reduce one, the
 default output is collected to local filesystem for partitioning and
 sorting at map end, and eventually grouping at reduce end. Basically:
 Data you want to send to reducer from mapper goes to local FS for
 multiple actions to be performed on them, other data may directly go
 to HDFS.

 Reducers currently are scheduled pretty randomly but yes their
 scheduling can be improved for certain scenarios. However, if you are
 pointing that map partitions ought to be written to HDFS itself (with
 replication or without), I don't see performance improving. Note that
 the partitions aren't merely written but need to be sorted as well (at
 either end). To do that would need ability to spill frequently (cause
 we don't have infinite memory to do it all in RAM) and doing such a
 thing on HDFS would only mean slowdown.

 Thanks for clearing my doubts. In this case I was merely suggesting that
if the mapper output (merged output in the end or the shuffle output) is
stored in HDFS then reducers can just retrieve it from HDFS instead of
asking tasktracker for it. Once reducer threads read it they can continue
to work locally.



 I hope this helps clear some things up for you.

 --
 Harsh J



Re: Doubt from the book Definitive Guide

2012-04-04 Thread Prashant Kommireddi
Hi Mohit,

What would be the advantage? Reducers in most cases read data from all
the mappers. In the case where mappers were to write to HDFS, a
reducer would still require to read data from other datanodes across
the cluster.

Prashant

On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

 On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file
 using
 HTTP call. But the description under The Reduce Side doesn't
 specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://

 The flow is simple as this:
 1. For M+R job, map completes its task after writing all partitions
 down into the tasktracker's local filesystem (under mapred.local.dir
 directories).
 2. Reducers fetch completion locations from events at JobTracker, and
 query the TaskTracker there to provide it the specific partition it
 needs, which is done over the TaskTracker's HTTP service (50060).

 So to clear things up - map doesn't send it to reduce, nor does reduce
 ask the actual map task. It is the task tracker itself that makes the
 bridge here.

 Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
 be over Netty connections. This would be much more faster and
 reliable.

 2) My understanding was that mapper output gets written to hdfs, since
 I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls
 to
 tasktrackers location?

 A map-only job usually writes out to HDFS directly (no sorting done,
 cause no reducer is involved). If the job is a map+reduce one, the
 default output is collected to local filesystem for partitioning and
 sorting at map end, and eventually grouping at reduce end. Basically:
 Data you want to send to reducer from mapper goes to local FS for
 multiple actions to be performed on them, other data may directly go
 to HDFS.

 Reducers currently are scheduled pretty randomly but yes their
 scheduling can be improved for certain scenarios. However, if you are
 pointing that map partitions ought to be written to HDFS itself (with
 replication or without), I don't see performance improving. Note that
 the partitions aren't merely written but need to be sorted as well (at
 either end). To do that would need ability to spill frequently (cause
 we don't have infinite memory to do it all in RAM) and doing such a
 thing on HDFS would only mean slowdown.

 Thanks for clearing my doubts. In this case I was merely suggesting that
 if the mapper output (merged output in the end or the shuffle output) is
 stored in HDFS then reducers can just retrieve it from HDFS instead of
 asking tasktracker for it. Once reducer threads read it they can continue
 to work locally.



 I hope this helps clear some things up for you.

 --
 Harsh J