Been doing lots of importing recently. There are two easy ways to get big performance boosts.
The first is HFileOuputFormat. It works into existing tables now. Consistently see 10X+ performance this way versus API. If you must use the API, pre-create a bunch of regions for your table. You can avoid splits altogether this way. Splitting, region balancing, etc... all make data unavailable for periods of time. Avoiding this during an import is key to getting rid of all those outliers I'm sure you see. I assume you have already played with client-side batching and all that? And writes are random? Also, bump up your flush size if you're going to have so many families. HBASE-2375 will help when it gets finished. Until then, pre-create regions and avoid the churn. Or use bulk load. JG > -----Original Message----- > From: Bradford Stephens [mailto:[email protected]] > Sent: Wednesday, September 01, 2010 6:58 PM > To: [email protected] > Subject: Re: Slow Inserts on EC2 Cluster > > On the full data set (10 reducers), speeds are about 100k/minute (WAL > Disabled). Still much slower than I'd like, but I'll take it over the > former :) > > On Wed, Sep 1, 2010 at 5:59 PM, Ryan Rawson <[email protected]> wrote: > > Yes exactly, column families have the same performance profile as > > tables. 12 CF = 12 tables. > > > > -ryan > > > > On Wed, Sep 1, 2010 at 5:56 PM, Bradford Stephens > > <[email protected]> wrote: > >> Good call JD! We've gone from 20k inserts/minute to 200k. Much > >> better! I still think it's slower than I'd want by about one OOM, > but > >> it's progress. > >> > >> Since we're populating 12 families, I guess we're seeking for 12 > files > >> on each write. Not pretty. I'll look at the customer and see if they > >> really have any sparse data that would benefit from its own > >> ColumnFamily. Probably not. > >> > >> Cheers, > >> B > >> > >> On Wed, Sep 1, 2010 at 5:37 PM, Bradford Stephens > >> <[email protected]> wrote: > >>> Yeah, those families are all needed -- but I didn't realize the > files > >>> were so small. That's odd -- and you're right, that'd certainly > throw > >>> it off. I'll merge them all and see if that helps. > >>> > >>> On Wed, Sep 1, 2010 at 5:24 PM, Jean-Daniel Cryans > <[email protected]> wrote: > >>>> Took a quick look at your RS log, it looks like you are using a > lot of > >>>> families and loading them pretty much at the same rate. Look at > lines > >>>> that start with: > >>>> > >>>> INFO org.apache.hadoop.hbase.regionserver.Store: Added ... > >>>> > >>>> And you will see that you are dumping very small files on the > >>>> filesystem, on average 5MB, that together account for ~64MB which > is > >>>> the default flush size (and then it generates tons of compactions > >>>> which makes it even worse). Do you really need all those families? > Try > >>>> merging them and see the difference. > >>>> > >>>> J-D > >>>> > >>>> On Wed, Sep 1, 2010 at 5:03 PM, Bradford Stephens > >>>> <[email protected]> wrote: > >>>>> 'allo, > >>>>> > >>>>> I changed the cluster form m1.large to c1.xlarge -- we're getting > >>>>> about 4k inserts /node / minute instead of 2k. A small > improvement, > >>>>> but nowhere near what I'm used to, even from vague memories of > old > >>>>> clusters on EC2. > >>>>> > >>>>> I also stripped all the Cascading from my code and have a very > basic > >>>>> raw MR job -- we're basically reading raw text, splitting it into > >>>>> fields, and adding those rows to HBase. About the simplest task > you > >>>>> could do. > >>>>> > >>>>> Ideas for next steps? What other info could I share? > >>>>> > >>>>> Cheers, > >>>>> B > >>>>> > >>>>> On Wed, Sep 1, 2010 at 10:55 AM, Andrew Purtell > <[email protected]> wrote: > >>>>>>> From: Gary Helmling > >>>>>>> > >>>>>>> If you're using AMIs based on the latest Ubuntu (10.4), > >>>>>>> theres a known kernel issue that seems to be causing > >>>>>>> high loads while idle. More info here: > >>>>>>> > >>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/574910 > >>>>>> > >>>>>> Seems best to avoid using Lucid on EC2 for now, then. > >>>>>> > >>>>>> FYI, the EC2 scripts that I use build AMIs based on Amazon's old > FC8 AMI (with updates). See http://github.com/apurtell/hbase-ec2 > >>>>>> > >>>>>> - Andy > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Bradford Stephens, > >>>>> Founder, Drawn to Scale > >>>>> drawntoscalehq.com > >>>>> 727.697.7528 > >>>>> > >>>>> http://www.drawntoscalehq.com -- The intuitive, cloud-scale data > >>>>> solution. Process, store, query, search, and serve all your data. > >>>>> > >>>>> http://www.roadtofailure.com -- The Fringes of Scalability, > Social > >>>>> Media, and Computer Science > >>>>> > >>>> > >>> > >>> > >>> > >>> -- > >>> Bradford Stephens, > >>> Founder, Drawn to Scale > >>> drawntoscalehq.com > >>> 727.697.7528 > >>> > >>> http://www.drawntoscalehq.com -- The intuitive, cloud-scale data > >>> solution. Process, store, query, search, and serve all your data. > >>> > >>> http://www.roadtofailure.com -- The Fringes of Scalability, Social > >>> Media, and Computer Science > >>> > >> > >> > >> > >> -- > >> Bradford Stephens, > >> Founder, Drawn to Scale > >> drawntoscalehq.com > >> 727.697.7528 > >> > >> http://www.drawntoscalehq.com -- The intuitive, cloud-scale data > >> solution. Process, store, query, search, and serve all your data. > >> > >> http://www.roadtofailure.com -- The Fringes of Scalability, Social > >> Media, and Computer Science > >> > > > > > > -- > Bradford Stephens, > Founder, Drawn to Scale > drawntoscalehq.com > 727.697.7528 > > http://www.drawntoscalehq.com -- The intuitive, cloud-scale data > solution. Process, store, query, search, and serve all your data. > > http://www.roadtofailure.com -- The Fringes of Scalability, Social > Media, and Computer Science
