Re: EC2 Elastic MapReduce HBase install recommendations

Pal Konyves Sat, 11 May 2013 19:15:09 -0700

Hi,

I decided not to make any tuning, because my whole project is about
experimenting with HBase (it's a scool project). However it turned out that
my sample data generated lots of rowkey collisions. 4 million inserts only
resulted in about 5000 rows. The data were different though in the columns.
When I changed my sample dataset to have no collisions in the rowkey, the
performance increased with a magnitude of 10. Why is that?


Thanks,
Pal


On Thu, May 9, 2013 at 2:32 PM, Michel Segel <[email protected]>wrote:

> What I am saying is that by default, you get two mappers per node.
> x4large can run HBase w more mapred slots, so you will want to tune the
> defaults based on machine size. Not just mapred, but also HBase stuff too.
> You need to do this on startup of EMR cluster though...
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 9, 2013, at 2:39 AM, Pal Konyves <[email protected]> wrote:
>
> > Principally I chose to use Amazon, because they are supposedly high
> > performance, and what more important is: HBase is already set up if I
> chose
> > it as an EMR Workflow. I wanted to save up the time setting up the
> cluster
> > manually on EC2 instances.
> >
> > Are you saying I will reach higher performance when I set up the HBase on
> > the cluster manually, instead of the default Amazon HBase distribution?
> Or
> > is it worth to tune the Amazon distribution with a bootstrap action? How
> > long does it take, to set up the cluster with HDFS manually?
> >
> > I will also try larger instance types.
> >
> >
> > On Thu, May 9, 2013 at 6:47 AM, Michel Segel <[email protected]
> >wrote:
> >
> >> With respect to EMR, you can run HBase fairly easily.
> >> You can't run MapR w HBase on EMR stick w Amazon's release.
> >>
> >> And you can run it but you will want to know your tuning parameters up
> >> front when you instantiate it.
> >>
> >>
> >>
> >> Sent from a remote device. Please excuse any typos...
> >>
> >> Mike Segel
> >>
> >> On May 8, 2013, at 9:04 PM, Andrew Purtell <[email protected]> wrote:
> >>
> >>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL
> datastore
> >>> with (I gather) an Apache HBase compatible Java API.
> >>>
> >>> As for running HBase on EC2, we recently discussed some particulars,
> see
> >>> the latter part of this thread:
> http://search-hadoop.com/m/rI1HpK90guwhere
> >>> I hijack it. I wouldn't recommend launching HBase as part of an EMR
> flow
> >>> unless you want to use it only for temporary random access storage, and
> >> in
> >>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up
> a
> >>> dedicated HBase backed storage service on high I/O instance types. The
> >>> fundamental issue is IO performance on the EC2 platform is fair to
> poor.
> >>>
> >>> I have also noticed a large difference in baseline block device latency
> >> if
> >>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this
> year.
> >>> Use the new ones, they cut the latency long tail in half. There were
> some
> >>> significant kernel level improvements I gather.
> >>>
> >>>
> >>> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda <
> >>> [email protected]> wrote:
> >>>
> >>>> I think that you when you are talking about RMap, you are referring to
> >>>> MapR´s distribution.
> >>>> I think that MapR´s team released a very good version of its Hadoop
> >>>> distribution focused on HBase called M7. You can see its overview
> here:
> >>>> http://www.mapr.com/products/mapr-editions/m7-edition
> >>>>
> >>>> But this release was under beta testing, and I see that it´s not
> >> included
> >>>> in the Amazon Marketplace yet:
> >>
> https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> 2013/5/7 Pal Konyves <[email protected]>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Has anyone got some recommendations about running HBase on EC2? I am
> >>>>> testing it, and so far I am very disappointed with it. I did not
> change
> >>>>> anything about the default 'Amazon distribution' installation. It has
> >> one
> >>>>> MasterNode and two slave nodes, and write performance is around 2500
> >>>> small
> >>>>> rows per sec at most, but I expected it to be way  better. Oh, and
> this
> >>>> is
> >>>>> with batch put operations with autocommit turned off, where each
> batch
> >>>>> containes about 500-1000 rows... When I do it with autocommit, it
> does
> >>>> not
> >>>>> even reach the 1000 rows per sec.
> >>>>>
> >>>>> Every nodes were m1.Large ones.
> >>>>>
> >>>>> Any experiences, suggestions? Is it worth to try the RMap
> distribution
> >>>>> instead of the amazon one?
> >>>>>
> >>>>> Thanks,
> >>>>> Pal
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Marcos Ortiz Valmaseda
> >>>> Product Manager at PDVSA
> >>>> http://about.me/marcosortiz
> >>>
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>>
> >>>  - Andy
> >>>
> >>> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> >>> (via Tom White)
> >>
>

Re: EC2 Elastic MapReduce HBase install recommendations

Reply via email to