Re: Reading in parallel from table's regions in MapReduce

Ioakim Perros Tue, 04 Sep 2012 09:30:16 -0700

Thank you very much for your response and for the excellent reference.

The thing is that I am running jobs on a distributed environment andbeyond the TableMapReduceUtil settings,

I have just set the scan ' s caching to the number of rows I expect toretrieve at each map task, and the scan's caching blocks feature tofalse (just as it is indicated at MapReduce examples of HBase's homepage).

I am not aware of such a job configuration (requesting jobtracker toexecute more than 1 map tasks concurrently). Any other ideas?


Thank you again and regards,
ioakim


On 09/04/2012 06:59 PM, Jerry Lam wrote:

Hi Loakim:

Sorry, your hypothesis doesn't make sense. I would suggest you to read the
"Learning HBase Internals" by Lars Hofhansl at
http://www.slideshare.net/cloudera/3-learning-h-base-internals-lars-hofhansl-salesforce-final
to
understand how HBase locking works.

Regarding to the issue you are facing, are you sure you configure the job
properly (i.e. requesting the jobtracker to have more than 1 mapper to
execute)? If you are testing on a single machine, you properly need to
configure the number of tasktracker per node as well to see more than 1
mapper to execute on a single machine.

my $0.02

Jerry

On Tue, Sep 4, 2012 at 11:17 AM, Ioakim Perros <[email protected]> wrote:

Hello,

I would be grateful if someone could shed a light to the following:

Each M/R map task is reading data from a separate region of a table.
 From the jobtracker 's GUI, at the map completion graph, I notice that
although data read from mappers are different, they read data sequentially
- like the table has a lock that permits only one mapper to read data from
every region at a time.

Does this "lock" hypothesis make sense? Is there any way I could avoid
this useless delay?

Thanks in advance and regards,
Ioakim

Re: Reading in parallel from table's regions in MapReduce

Reply via email to