Re: Batch Get performance degrades from within Mapreduce

Himanish Kushary Tue, 21 Feb 2012 15:38:37 -0800

Thanks for your response J-D.

*You're using it in a way that doesn't make sense to me as you use a SAN*,


[Himanish] - Is it recommended not to use SAN while using HBase ? We were
also planning to move to dedicated SATA hard drives on the
                   production environment

*don't put HBase on all nodes, have a tiny cluster... or is it just a
testbed?*

[Himanish] - You are right..it is a testbed..though even on production it
will be small cluster , around 4-5 nodes

*In order to do a more proper comparison I think you should run your unit
test on one of the machines and make sure it stores the data in
the SAN. This is one big wildcard here. Also get stats from that SAN.*

[Himanish] - What kind of puzzles me is that the GETs fired from my
laptop(unit testing) has to go over the network whereas when the M/R fires
                   those GETs from the mapper,they are working on local
data.Still the M/R GETs are performing much worse than the unit test case

Also to make things a little clearer this is the pseudo code of what I am
doing :

a) GET all rows from Table 1 into RESULTs
b) For each RESULT create a GET for fetching data from Table 2 and put it
into a GET list
c) Fire the batch GETs
d) Process the RESULTs from GETs in step# (c)

For M/R --------- Table 1 in Step # (a) is the input table to the M/R.Step
b,c,d are executed within the mapper
For Unit Testing -------- The test case takes care of all the above 4 steps

Thanks
Himanish




On Tue, Feb 21, 2012 at 5:27 PM, Jean-Daniel Cryans <[email protected]>wrote:

> Something that strikes me in your answers is why have you chosen HBase
> for this? You're using it in a way that doesn't make sense to me as
> you use a SAN, don't put HBase on all nodes, have a tiny cluster... or
> is it just a testbed?
>
> In order to do a more proper comparison I think you should run your
> unit test on one of the machines and make sure it stores the data in
> the SAN. This is one big wildcard here. Also get stats from that SAN.
>
> J-D
>
> On Tue, Feb 21, 2012 at 12:31 PM, Himanish Kushary <[email protected]>
> wrote:
> > Extremely sorry for the posts , I was also trying to provide a little bit
> > more information on our environment
> >
> > - You say you have 1 region server and 3 datanodes. Is there an
> > intersection? If not, you miss out on enabling local reads and take a
> > big performance hit although if you didn't enable it for your unit
> > test then it's just something you might want to look at later. : The
> region
> > server is colocated with one of the datanodes out of the 3.
> >
> > - What's the machine that runs the unit test like? - Unit test is running
> > on my laptop(8 core/8 GB) through Eclipse.
> >
> > - How many disks per datanodes? JBOD SATA or fancier? - Datanode
> directory
> > are configured to point to a SAN drives
> >
> > - Where are the mappers running? One task tracker per datanode? Or is it
> > per regionserver (eg 1)? - Yes, 1 TT per datanode.The server hosting the
> > regionserver also has a TT
> >
> > - You say you have 8 concurrent mappers running... so I don't know if
> > they are all on the same machine or not (see my previous question),
> > but since you have 7 regions it means by default you can only have 7
> > mappers running. Where's the 8th one coming from? - My mapreduce job
> works
> > off a table which has 8 regions.But from inside the mapper I fire
> thousands
> > of GET's to another different table which has 7 regions
> >
> > - When the MR job is running, how are the disks performing (via
> > iostat)? Again knowing whether or not the RS is colocated with a DN
> > would help at lot. - iostat on the regionserver during the MR shows
> >
> > Time: 03:27:15 PM
> > avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >          21.65    0.01    5.08    4.14    0.00   69.11
> >
> > Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> > sda               3.13         0.02         0.02       2176       1983
> > sda1              0.00         0.00         0.00          8          0
> > sda2              3.12         0.02         0.02       2167       1983
> > sdb              52.03         2.21         0.44     289621      58030
> > dm-0              5.41         0.02         0.02       2167       1983
> > dm-1              0.00         0.00         0.00          0          0
> >
> >
> > - Is the data set the same in the unit test and in the MR test? - The
> data
> > sets for the actual MR job is the same .The data set for the GETs within
> > the mapper are much much more than from the MR ( 120000 vs 2000 GETs)
> >
> > -- Thanks
> > Himanish
> >
> >
> >
> > On Tue, Feb 21, 2012 at 2:49 PM, Jean-Daniel Cryans <[email protected]
> >wrote:
> >
> >> First a side comment: if you send an email to a mailing list like this
> >> one and didn't get any answer within a few hours, sending another one
> >> right away usually won't help. It's just bad etiquette.
> >>
> >> Now I'm reading over the whole thread and things are really not that
> >> clear to me.
> >>
> >> - You say you have 1 region server and 3 datanodes. Is there an
> >> intersection? If not, you miss out on enabling local reads and take a
> >> big performance hit although if you didn't enable it for your unit
> >> test then it's just something you might want to look at later.
> >>
> >> - What's the machine that runs the unit test like?
> >>
> >> - How many disks per datanodes? JBOD SATA or fancier?
> >>
> >> - Where are the mappers running? One task tracker per datanode? Or is
> >> it per regionserver (eg 1)?
> >>
> >> - You say you have 8 concurrent mappers running... so I don't know if
> >> they are all on the same machine or not (see my previous question),
> >> but since you have 7 regions it means by default you can only have 7
> >> mappers running. Where's the 8th one coming from?
> >>
> >> - When the MR job is running, how are the disks performing (via
> >> iostat)? Again knowing whether or not the RS is colocated with a DN
> >> would help at lot.
> >>
> >> - Is the data set the same in the unit test and in the MR test?
> >>
> >> Thx,
> >>
> >> J-D
> >>
> >> On Mon, Feb 20, 2012 at 5:42 PM, Himanish Kushary <[email protected]>
> >> wrote:
> >> > Could somebody help me figure out whats the difference while running
> >> > through map-reduce..is it just the concurrency that causing the
> >> issue.Will
> >> > increasing the number of region servers help ?
> >> >
> >> > BTW, the master is also on the same server as the regionserver.Is it
> >> just a
> >> > environment issue or there is some other configuration that me improve
> >> the
> >> > read performance from within the mapper.
> >> >
> >> > Thanks
> >> > Himanish
> >>
> >
> >
> >
> > --
> > Thanks & Regards
> > Himanish
>



-- 
Thanks & Regards
Himanish

Re: Batch Get performance degrades from within Mapreduce

Reply via email to