Hello!
I've found the problem with the number of mappers. We are running the M/R
jobs with Oozie that apparently ignores the set up of the mapred.map.tasks
property that is used as a hint for computing the number of splits. Cite from
the TableInputFormatBase#getSplits(old API) java doc:
"Splits are created in number equal to the smallest between numSplits and
the number of {@link HRegion}s in the table. If the number of splits is
smaller than the number of {@link HRegion}s then splits are spanned across
multiple {@link HRegion}s and are grouped the most evenly possible. In the
case splits are uneven the bigger splits are placed first in the
{@link InputSplit} array."
By default the mapred.map.tasks is set up to 2. Applying the above algorithm on
my scenario (and the oozie observation), computing
min(mapred.map.tasks=2,number_of_my_regions=32) then we obtain the "magic"
number of mappers 2.
We have observed this behavior, by implementing a Driver for the MR job and
setting up the mapred.map.tasks to 40 let's say. Then the number of mappers are
calculated correctly to 32.
Regards,
Florin
--- On Mon, 6/27/11, Florin P <[email protected]> wrote:
> From: Florin P <[email protected]>
> Subject: RE: Obtain many mappers (or regions)
> To: [email protected]
> Date: Monday, June 27, 2011, 8:46 AM
> Hi!
> Thank you for your response. As I said, it is a
> temporary table. This table acts as a metadata for long
> tasks processing that we would like to trigger from the
> cluster (as map/reduce jobs) in order that all machines to
> take some of that tasks.
> I have read the indicated chapter, and then I have
> followed the scenario:
> 1.We have loaded the small data into the
> hbase table
> 2. From the hbase admin interface we
> triggered the split action
> 3. We have seen that 32 new regions were
> created for that table
> 4. We have ran a map/reduce job that
> counts the number of rows
> 5. Only two mappers were created
> What is puzzles me is that only 2 mapper tasks were
> created, even in the indicated book it is stated that
> (cite)"
> When TableInputFormat, is used to source an HBase table in
> a MapReduce job, its splitter will make a map task for each
> region of the table. Thus, if there are 100 regions in the
> table, there will be 100 map-tasks for the job - regardless
> of how many column families are selected in the Scan.
> "
>
> Can you please explain why this is happen? Did we miss some
> property configuration?
>
> Thank you.
> regards,
> Florin
> --- On Mon, 6/27/11, Doug Meil <[email protected]>
> wrote:
>
> > From: Doug Meil <[email protected]>
> > Subject: RE: Obtain many mappers (or regions)
> > To: "[email protected]"
> <[email protected]>
> > Date: Monday, June 27, 2011, 8:01 AM
> > Hi there-
> >
> > If you only have 100 rows I think that HBase might be
> > overkill.
> >
> > You probably want to start with this to get a
> background on
> > what HBase can do...
> > http://hbase.apache.org/book.html
> > .. there is a section on MapReduce with HBase as
> well.
> >
> > -----Original Message-----
> > From: Florin P [mailto:[email protected]]
> >
> > Sent: Monday, June 27, 2011 4:53 AM
> > To: [email protected]
> > Subject: Obtain many mappers (or regions)
> >
> > Hello!
> > I have the following scenario:
> > 1. A temporary HBase table with small number of rows
> (aprox
> > 100) 2. A cluster with 2 machines that I would like
> to
> > crunch the data contained in the rows
> >
> > I would like to create two mappers that will crunch
> the
> > data from rows.
> > How can I achieve this?
> > A general question is:
> > how we can obtain many mappers to crunch small
> data
> > quantity?
> >
> > Thank you.
> > Regards,
> > Florin
> >
>