Re: Reading in parallel from table's regions in MapReduce

Ioakim Perros Tue, 04 Sep 2012 10:15:34 -0700

Jerry thank you very much for the links.

Regards,
Ioakim


On 09/04/2012 08:05 PM, Jerry Lam wrote:

Hi Loakim:

Here a list of links I would suggest you to read (I know it is a lot to
read):
HBase Related:
-
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
-
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
- make sure to read the examples:
http://hbase.apache.org/book/mapreduce.example.html

Hadoop Related:
- http://wiki.apache.org/hadoop/JobTracker
- http://wiki.apache.org/hadoop/TaskTracker
- http://hadoop.apache.org/common/docs/r1.0.3/mapred_tutorial.html
- Some Configurations:
http://hadoop.apache.org/common/docs/r1.0.3/cluster_setup.html

HTH,

Jerry


On Tue, Sep 4, 2012 at 12:41 PM, Michael Segel <[email protected]>wrote:

I think the issue is that you are misinterpreting what you are seeing and
what Doug was trying to tell you...

The short simple answer is that you're getting one split per region. Each
split is assigned to a specific mapper task and that task will sequentially
walk through the table finding the rows that match your scan request.

There is no lock or blocking.

I think you really should actually read Lars George's book on HBase to get
a better understanding.

HTH

-Mike

On Sep 4, 2012, at 11:29 AM, Ioakim Perros <[email protected]> wrote:

Thank you very much for your response and for the excellent reference.

The thing is that I am running jobs on a distributed environment and

beyond the TableMapReduceUtil settings,

I have just set the scan ' s caching to the number of rows I expect to

retrieve at each map task, and the scan's caching blocks feature to false
(just as it is indicated at MapReduce examples of HBase's homepage).

I am not aware of such a job configuration (requesting jobtracker to

execute more than 1 map tasks concurrently). Any other ideas?

Thank you again and regards,
ioakim


On 09/04/2012 06:59 PM, Jerry Lam wrote:

Hi Loakim:

Sorry, your hypothesis doesn't make sense. I would suggest you to read

the

"Learning HBase Internals" by Lars Hofhansl at

http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final

to
understand how HBase locking works.

Regarding to the issue you are facing, are you sure you configure the

job

properly (i.e. requesting the jobtracker to have more than 1 mapper to
execute)? If you are testing on a single machine, you properly need to
configure the number of tasktracker per node as well to see more than 1
mapper to execute on a single machine.

my $0.02

Jerry

On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[email protected]>

wrote:

Hello,

I would be grateful if someone could shed a light to the following:

Each M/R map task is reading data from a separate region of a table.
 From the jobtracker 's GUI, at the map completion graph, I notice that
although data read from mappers are different, they read data

sequentially

- like the table has a lock that permits only one mapper to read data

from

every region at a time.

Does this "lock" hypothesis make sense? Is there any way I could avoid
this useless delay?

Thanks in advance and regards,
Ioakim

Re: Reading in parallel from table's regions in MapReduce

Reply via email to