Hi, I decided not to make any tuning, because my whole project is about experimenting with HBase (it's a scool project). However it turned out that my sample data generated lots of rowkey collisions. 4 million inserts only resulted in about 5000 rows. The data were different though in the columns. When I changed my sample dataset to have no collisions in the rowkey, the performance increased with a magnitude of 10. Why is that?
Thanks, Pal On Thu, May 9, 2013 at 2:32 PM, Michel Segel <[email protected]>wrote: > What I am saying is that by default, you get two mappers per node. > x4large can run HBase w more mapred slots, so you will want to tune the > defaults based on machine size. Not just mapred, but also HBase stuff too. > You need to do this on startup of EMR cluster though... > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 9, 2013, at 2:39 AM, Pal Konyves <[email protected]> wrote: > > > Principally I chose to use Amazon, because they are supposedly high > > performance, and what more important is: HBase is already set up if I > chose > > it as an EMR Workflow. I wanted to save up the time setting up the > cluster > > manually on EC2 instances. > > > > Are you saying I will reach higher performance when I set up the HBase on > > the cluster manually, instead of the default Amazon HBase distribution? > Or > > is it worth to tune the Amazon distribution with a bootstrap action? How > > long does it take, to set up the cluster with HDFS manually? > > > > I will also try larger instance types. > > > > > > On Thu, May 9, 2013 at 6:47 AM, Michel Segel <[email protected] > >wrote: > > > >> With respect to EMR, you can run HBase fairly easily. > >> You can't run MapR w HBase on EMR stick w Amazon's release. > >> > >> And you can run it but you will want to know your tuning parameters up > >> front when you instantiate it. > >> > >> > >> > >> Sent from a remote device. Please excuse any typos... > >> > >> Mike Segel > >> > >> On May 8, 2013, at 9:04 PM, Andrew Purtell <[email protected]> wrote: > >> > >>> M7 is not Apache HBase, or any HBase. It is a proprietary NoSQL > datastore > >>> with (I gather) an Apache HBase compatible Java API. > >>> > >>> As for running HBase on EC2, we recently discussed some particulars, > see > >>> the latter part of this thread: > http://search-hadoop.com/m/rI1HpK90guwhere > >>> I hijack it. I wouldn't recommend launching HBase as part of an EMR > flow > >>> unless you want to use it only for temporary random access storage, and > >> in > >>> which case use m2.2xlarge/m2.4xlarge instance types. Otherwise, set up > a > >>> dedicated HBase backed storage service on high I/O instance types. The > >>> fundamental issue is IO performance on the EC2 platform is fair to > poor. > >>> > >>> I have also noticed a large difference in baseline block device latency > >> if > >>> using an old Amazon Linux AMI (< 2013) or the latest AMIs from this > year. > >>> Use the new ones, they cut the latency long tail in half. There were > some > >>> significant kernel level improvements I gather. > >>> > >>> > >>> On Wed, May 8, 2013 at 10:42 AM, Marcos Luis Ortiz Valmaseda < > >>> [email protected]> wrote: > >>> > >>>> I think that you when you are talking about RMap, you are referring to > >>>> MapR´s distribution. > >>>> I think that MapR´s team released a very good version of its Hadoop > >>>> distribution focused on HBase called M7. You can see its overview > here: > >>>> http://www.mapr.com/products/mapr-editions/m7-edition > >>>> > >>>> But this release was under beta testing, and I see that it´s not > >> included > >>>> in the Amazon Marketplace yet: > >> > https://aws.amazon.com/marketplace/seller-profile?id=802b0a25-877e-4b57-9007-a3fd284815a5 > >>>> > >>>> > >>>> > >>>> > >>>> 2013/5/7 Pal Konyves <[email protected]> > >>>> > >>>>> Hi, > >>>>> > >>>>> Has anyone got some recommendations about running HBase on EC2? I am > >>>>> testing it, and so far I am very disappointed with it. I did not > change > >>>>> anything about the default 'Amazon distribution' installation. It has > >> one > >>>>> MasterNode and two slave nodes, and write performance is around 2500 > >>>> small > >>>>> rows per sec at most, but I expected it to be way better. Oh, and > this > >>>> is > >>>>> with batch put operations with autocommit turned off, where each > batch > >>>>> containes about 500-1000 rows... When I do it with autocommit, it > does > >>>> not > >>>>> even reach the 1000 rows per sec. > >>>>> > >>>>> Every nodes were m1.Large ones. > >>>>> > >>>>> Any experiences, suggestions? Is it worth to try the RMap > distribution > >>>>> instead of the amazon one? > >>>>> > >>>>> Thanks, > >>>>> Pal > >>>> > >>>> > >>>> > >>>> -- > >>>> Marcos Ortiz Valmaseda > >>>> Product Manager at PDVSA > >>>> http://about.me/marcosortiz > >>> > >>> > >>> > >>> -- > >>> Best regards, > >>> > >>> - Andy > >>> > >>> Problems worthy of attack prove their worth by hitting back. - Piet > Hein > >>> (via Tom White) > >> >
