Re: Read/Write Performance

Wayne Sun, 02 Jan 2011 18:05:25 -0800

The LZO did not seem to work with the 1GB region size. It was causing
several minute pauses followed by 5 seconds of requests being processed and
then again 30+ second pauses (is this GC, compaction, or splits??...all
regions go to 0 requests but the Meta region which has a few hundred
requests). Once going back to the default region size the pauses seemed to
go away. I still see all nodes going to zero requests except the Meta table
region, but it only lasts for a few seconds at the most. We also had the max
open files problem (which has now been fixed), so don't know if that could
have caused that.


I have been loading for several days and since fixing the max file problem
everything has been smooth. We are up to 2+ TB compressed and almost 5,000
regions on 10 nodes. Lzop sure helps pack a lot of data into a small space.
The writes seem to hover in the 7.5-8k/node/sec range. This week we will
test the reads. So far so good.

Thanks.


On Sun, Jan 2, 2011 at 8:11 PM, Sean Bigdatafun
<[email protected]>wrote:

> Has this cured the GC pause at all? I do not see why turning on LZO is
> relavent at all (I read your email, and it sounds that you saw pause after
> the LZO is turned on).
>
> BTW, are you using CMS on a 8GB Heapsize JVM and experiencing a 4 mins
> pause? That sounds a lot.
>
> On Thu, Dec 30, 2010 at 1:51 PM, Wayne <[email protected]> wrote:
>
> > Lesson learned...restart thrift servers *after* restarting hadoop+hbase.
> >
> > On Thu, Dec 30, 2010 at 3:39 PM, Wayne <[email protected]> wrote:
> >
> > > We have restarted with lzop compression, and now I am seeing some
> really
> > > long and frequent stop the world pauses of the entire cluster. The load
> > > requests for all regions all go to zero except for the meta table
> region.
> > No
> > > data batches are getting in (no loads are occurring) and everything
> seems
> > > frozen. It seems to last for 5+ seconds. Is this GC on the master or GC
> > in
> > > the meta region? What could cause everything to stop for several
> seconds?
> > It
> > > appears to happen on a recurring basis as well. I think we saw it
> before
> > > switching to lzo but it seems much worse now (lasts longer and occurs
> > more
> > > frequently).
> > >
> > > Thanks.
> > >
> > >
> > >
> > > On Thu, Dec 30, 2010 at 12:20 PM, Wayne <[email protected]> wrote:
> > >
> > >> HBase Version 0.89.20100924, r1001068 w/ 8GB heap
> > >>
> > >> I plan to run for 1 week straight maxed out. I am worried about GC
> > pauses,
> > >> especially concurrent mode failures (does hbase/hadoop suffer these
> > under
> > >> extended load?). What should I be looking for in the gc log in terms
> of
> > >> problem signs? The ParNews are quick but the CMS concurrent marks are
> > taking
> > >> as much as 4 mins with an average of 20-30 secs.
> > >>
> > >> Thanks.
> > >>
> > >>
> > >>
> > >> On Thu, Dec 30, 2010 at 12:00 PM, Stack <[email protected]> wrote:
> > >>
> > >>> Oh, what versions are you using?
> > >>> St.Ack
> > >>>
> > >>> On Thu, Dec 30, 2010 at 9:00 AM, Stack <[email protected]> wrote:
> > >>> > Keep going. Let it run longer.  Get the servers as loaded as you
> > think
> > >>> > they'll be in production.  Make sure the perf numbers are not
> because
> > >>> > cluster is 'fresh'.
> > >>> > St.Ack
> > >>> >
> > >>> > On Thu, Dec 30, 2010 at 5:51 AM, Wayne <[email protected]> wrote:
> > >>> >> We finally got our cluster up and running and write performance
> > looks
> > >>> very
> > >>> >> good. We are getting sustained 8-10k writes/sec/node on a 10 node
> > >>> cluster
> > >>> >> from Python through thrift. These are values written to 3 CFs so
> > >>> actual
> > >>> >> hbase performance is 25-30k writes/sec/node. The nodes are
> currently
> > >>> disk
> > >>> >> i/o bound (40-50% utilization) but hopefully once we get lzop
> > working
> > >>> this
> > >>> >> will go down. We have been running for 12 hours without a problem.
> > We
> > >>> hope
> > >>> >> to get lzop going today and then load all through the long
> weekend.
> > >>> >>
> > >>> >> We plan to then test reads next week after we get some data in
> > there.
> > >>> Looks
> > >>> >> good so far! Below are our settings in case there are some
> > >>> >> suggestions/concerns.
> > >>> >>
> > >>> >> Thanks for everyone's help. It is pretty exciting to get
> performance
> > >>> like
> > >>> >> this from the start.
> > >>> >>
> > >>> >>
> > >>> >> *Global*
> > >>> >>
> > >>> >> client.write.buffer = 10485760 (10MB = 5x default)
> > >>> >>
> > >>> >> optionalLogFlushInterval = 10000 (10 secs = 10x default)
> > >>> >>
> > >>> >> memstore.flush.size = 268435456 (256MB = 4x default)
> > >>> >>
> > >>> >> hregion.max.filesize = 1073741824 (1GB = 4x default)
> > >>> >>
> > >>> >> *Table*
> > >>> >>
> > >>> >> alter 'xxx', METHOD => 'table_att', DEFERRED_LOG_FLUSH => true
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >>
> > >>> >> On Wed, Dec 29, 2010 at 12:55 AM, Stack <[email protected]> wrote:
> > >>> >>
> > >>> >>> On Mon, Dec 27, 2010 at 11:47 AM, Wayne <[email protected]>
> wrote:
> > >>> >>> > All data is written to 3 CFs. Basically 2 of the CFs are
> > secondary
> > >>> >>> indexes
> > >>> >>> > (manually managed as normal CFs). It sounds like we should try
> > hard
> > >>> to
> > >>> >>> get
> > >>> >>> > as much out of thrift as we can before going to a lower level.
> > >>> >>>
> > >>> >>> Yes.
> > >>> >>>
> > >>> >>> Writes need
> > >>> >>> > to be "fast enough", but reads are more important in the end
> (and
> > >>> are the
> > >>> >>> > reason we are switching from a different solution). The numbers
> > you
> > >>> >>> quoted
> > >>> >>> > below sound like they are in the ballpark of what we are
> looking
> > to
> > >>> do.
> > >>> >>> >
> > >>> >>>
> > >>> >>> Even the tens per second that I threw in there to CMA?
> > >>> >>>
> > >>> >>> > Much of our data is cold, and we expect reads to be disk i/o
> > based.
> > >>> >>>
> > >>> >>> OK.  FYI, we're not the best at this -- cache-miss cold reads --
> > what
> > >>> >>> w/ a network hop in the way and currently we'll open a socket per
> > >>> >>> access.
> > >>> >>>
> > >>> >>> > Given
> > >>> >>> > this is 8GB heap a good place to start on the data nodes (24GB
> > >>> ram)? Is
> > >>> >>> the
> > >>> >>> > block cache managed on its own (being it won't blow up causing
> > >>> OOM),
> > >>> >>>
> > >>> >>> It won't.  Its constrained.  Does our home-brewed sizeof.
>  Default,
> > >>> >>> its 0.2 of total heap.  If you think cache will help, you could
> go
> > up
> > >>> >>> from there.  0.4 or 0.5 of heap.
> > >>> >>>
> > >>> >>> > and if
> > >>> >>> > we do not use it (block cache) should we go even lower for the
> > heap
> > >>> (we
> > >>> >>> want
> > >>> >>> > to avoid CMF and long GC pauses)?
> > >>> >>>
> > >>> >>> If you are going to be doing cache-miss most of the time and cold
> > >>> >>> reads, then yes, you can do away with cache.
> > >>> >>>
> > >>> >>> In testing of 0.90.x I've been running w/ 1MB heaps with 1k
> regions
> > >>> >>> but this is my trying to break stuff.
> > >>> >>>
> > >>> >>> > Are there any timeouts we need to tweak to
> > >>> >>> > make the cluster more "accepting" of long GC pauses while under
> > >>> sustained
> > >>> >>> > load (7+ days of 10k/inserts/sec/node)?
> > >>> >>> >
> > >>> >>>
> > >>> >>> If zookeeper client timesout, the regionserver will shut itself
> > down.
> > >>> >>> In 0.90.0RC2, the client sessionout is set high -- 3 minutes.  If
> > you
> > >>> >>> timeout that, then thats pretty extreme... something badly wrong
> > I'd
> > >>> >>> say.  Heres' a few notes on the config and others that you might
> > want
> > >>> >>> to twiddle (see previous section on required configs... make sure
> > >>> >>> you've got those too):
> > >>> >>>
> > >>> >>>
> > >>>
> >
> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
> > <
> >
> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations
> <
> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
> >
> > >
> > >>> <
> > >>>
> >
> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations
> <
> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
> >
> > >>> >
> > >>> >>>
> > >>> >>>
> > >>> >>> > Does LZO compression speed up reads/writes where there is
> excess
> > >>> CPU to
> > >>> >>> do
> > >>> >>> > the compression? I assume it would lower disk i/o but increase
> > CPU
> > >>> a lot.
> > >>> >>> Is
> > >>> >>> > data compressed on the initial write or only after compaction?
> > >>> >>> >
> > >>> >>>
> > >>> >>> LZO is pretty frictionless -- i.e. little CPU cost -- and yes,
> > >>> usually
> > >>> >>> helps speed things up (grab more in the one go).  What size are
> > your
> > >>> >>> records?  You might want to mess w/ hfile block sizes though the
> > 64k
> > >>> >>> default is usually good enough for all but very small cell sizes.
> > >>> >>>
> > >>> >>>
> > >>> >>> > With the replication in the HDFS layer how are reads managed in
> > >>> terms of
> > >>> >>> > load balancing across region servers? Does HDFS know to spread
> > >>> multiple
> > >>> >>> > requests across the 3 region servers that contain the same
> data?
> > >>> >>>
> > >>> >>> You only read from one of the replicas, always the 'closest'.  If
> > the
> > >>> >>> DFSClient has trouble getting the first of the replicas, it moves
> > on
> > >>> >>> to the second, etc.
> > >>> >>>
> > >>> >>>
> > >>> >>> > For example
> > >>> >>> > with 10 data nodes if we have 50 concurrent readers with very
> > >>> "random"
> > >>> >>> key
> > >>> >>> > requests we would expect to have 5 reads occurring on each data
> > >>> node at
> > >>> >>> the
> > >>> >>> > same time. We plan to have a thrift server on each data node,
> so
> > 5
> > >>> >>> > concurrent readers will be connected to each thrift server at
> any
> > >>> given
> > >>> >>> time
> > >>> >>> > (50 in aggregate across 10 nodes). We want to be sure
> everything
> > is
> > >>> >>> designed
> > >>> >>> > to evenly spread this load to avoid any possible hot-spots.
> > >>> >>> >
> > >>> >>>
> > >>> >>> This is different.  This is key design.  A thrift server will be
> > >>> doing
> > >>> >>> some subset of the key space.  If the requests are evenly
> > distributed
> > >>> >>> over all of the key space, then you should be fine; all thrift
> > >>> servers
> > >>> >>> will be evenly loaded.  If not, then there could be hot spots.
> > >>> >>>
> > >>> >>> We have a balancer that currently only counts regions per server,
> > not
> > >>> >>> regions per server plus hits per region so it could be the case
> > that
> > >>> a
> > >>> >>> server by chance ends up carrying all of the hot regions.  HBase
> > >>> >>> itself is not too smart dealing with this.  In 0.90.0, there is
> > >>> >>> facility for manually moving regions -- i.e. closing in current
> > >>> >>> location and moving the region off to another server w/ some
> outage
> > >>> >>> while the move is happening (usually seconds) -- or you could
> split
> > >>> >>> the hot region manually and then the daughters could be moved off
> > to
> > >>> >>> other servers... Primitive for now but should be better in next
> > HBase
> > >>> >>> versions.
> > >>> >>>
> > >>> >>> Have you been able to test w/ your data and your query pattern?
> > >>> >>> That'll tell you way more than I ever could.
> > >>> >>>
> > >>> >>> Good luck,
> > >>> >>> St.Ack
> > >>> >>>
> > >>> >>>
> > >>> >>> >
> > >>> >>> >
> > >>> >>> > On Mon, Dec 27, 2010 at 1:49 PM, Stack <[email protected]>
> wrote:
> > >>> >>> >
> > >>> >>> >> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <[email protected]>
> > wrote:
> > >>> >>> >> > We are in the process of evaluating hbase in an effort to
> > switch
> > >>> from
> > >>> >>> a
> > >>> >>> >> > different nosql solution. Performance is of course an
> > important
> > >>> part
> > >>> >>> of
> > >>> >>> >> our
> > >>> >>> >> > evaluation. We are a python shop and we are very worried
> that
> > we
> > >>> can
> > >>> >>> not
> > >>> >>> >> get
> > >>> >>> >> > any real performance out of hbase using thrift (and must
> drop
> > >>> down to
> > >>> >>> >> java).
> > >>> >>> >> > We are aware of the various lower level options for bulk
> > insert
> > >>> or
> > >>> >>> java
> > >>> >>> >> > based inserts with turning off WAL etc. but none of these
> are
> > >>> >>> available
> > >>> >>> >> to
> > >>> >>> >> > us in python so are not part of our evaluation.
> > >>> >>> >>
> > >>> >>> >> I can understand python for continuous updates from your
> > frontend
> > >>> or
> > >>> >>> >> whatever but you might consider hacking up a bit of java to
> make
> > >>> us of
> > >>> >>> >> the bulk updater; you'll get upload rates orders of magnitude
> > >>> beyond
> > >>> >>> >> what you'd achieve going via the API via python (or java for
> > that
> > >>> >>> >> matter).  You can also do incremental updates using the bulk
> > >>> loader.
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> We have a 10 node cluster
> > >>> >>> >> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region
> > >>> nodes, and
> > >>> >>> we
> > >>> >>> >> are
> > >>> >>> >> > looking for suggestions on configuration as well as
> benchmarks
> > >>> in
> > >>> >>> terms
> > >>> >>> >> of
> > >>> >>> >> > expectations of performance. Below are some specific
> > questions.
> > >>> I
> > >>> >>> realize
> > >>> >>> >> > there are a million factors that help determine specific
> > >>> performance
> > >>> >>> >> > numbers, so any examples of performance from running
> clusters
> > >>> would be
> > >>> >>> >> great
> > >>> >>> >> > as examples of what can be done.
> > >>> >>> >>
> > >>> >>> >> Yeah, you have been around the block obviously. Its hard to
> give
> > >>> out
> > >>> >>> >> 'numbers' since so many different factors involved.
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> Again thrift seems to be our "problem" so
> > >>> >>> >> > non java based solutions are preferred (do any non java
> based
> > >>> shops
> > >>> >>> run
> > >>> >>> >> > large scale hbase clusters?). Our total production cluster
> > size
> > >>> is
> > >>> >>> >> estimated
> > >>> >>> >> > to be 50TB.
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >> There are some substantial shops running non-java; e.g. the
> > yfrog
> > >>> >>> >> folks go via REST, the mozilla fellas are python over thrift,
> > >>> >>> >> Stumbleupon is php over thrift.
> > >>> >>> >>
> > >>> >>> >> > Our data model is 3 CFs, one primary and 2 secondary
> indexes.
> > >>> All
> > >>> >>> writes
> > >>> >>> >> go
> > >>> >>> >> > to all 3 CFs and are grouped as a batch of row mutations
> which
> > >>> should
> > >>> >>> >> avoid
> > >>> >>> >> > row locking issues.
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >> A write updates 3CFs and secondary indices?  Thats an
> expensive
> > >>> Put
> > >>> >>> >> relatively.  You have to run w/ 3CFs?  It facilitates fast
> > >>> querying?
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > What heap size is recommended for master, and for region
> > servers
> > >>> (24gb
> > >>> >>> >> ram)?
> > >>> >>> >>
> > >>> >>> >> Master doesn't take much heap, at least not in the coming
> 0.90.0
> > >>> HBase
> > >>> >>> >> (Is that what you intend to run)?
> > >>> >>> >>
> > >>> >>> >> The more RAM you give the regionservers, the more cache your
> > >>> cluster
> > >>> >>> will
> > >>> >>> >> have.
> > >>> >>> >>
> > >>> >>> >> Whats important to you read or write times?
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > What other settings can/should be tweaked in hbase to
> optimize
> > >>> >>> >> performance
> > >>> >>> >> > (we have looked at the wiki page)?
> > >>> >>> >>
> > >>> >>> >> Thats a good place to start.  Take a look through this mailing
> > >>> list
> > >>> >>> >> for others (Its time for a trawl of mailing list and then
> > >>> distilling
> > >>> >>> >> the findings into a reedit of our perf page).
> > >>> >>> >>
> > >>> >>> >> > What is a good batch size for writes? We will start with 10k
> > >>> >>> >> values/batch.
> > >>> >>> >>
> > >>> >>> >> Start small with defaults.  Make sure its all running smooth
> > >>> first.
> > >>> >>> >> Then rachet it up.
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > How many concurrent writers/readers can a single data node
> > >>> handle with
> > >>> >>> >> > evenly distributed load? Are there settings specific to
> this?
> > >>> >>> >>
> > >>> >>> >> How many clients you going to have writing HBase?
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > What is "very good" read/write latency for a single put/get
> in
> > >>> hbase
> > >>> >>> >> using
> > >>> >>> >> > thrift?
> > >>> >>> >>
> > >>> >>> >> "Very Good" would be < a few milliseconds.
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> > What is "very good" read/write throughput per node in hbase
> > >>> using
> > >>> >>> thrift?
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >> Thousands of ops per second per regionserver (Sorry, can't be
> > more
> > >>> >>> >> specific than that).  If the Puts are multi-family + updates
> on
> > >>> >>> >> secondary indices, hundreds -- maybe even tens... I'm not sure
> > --
> > >>> >>> >> rather than thousands.
> > >>> >>> >>
> > >>> >>> >> > We are looking to get performance numbers in the range of
> 10k
> > >>> >>> aggregate
> > >>> >>> >> > inserts/sec/node and read latency < 30ms/read with 3-4
> > >>> concurrent
> > >>> >>> >> > readers/node. Can our expectations be met with hbase through
> > >>> thrift?
> > >>> >>> Can
> > >>> >>> >> > they be met with hbase through java?
> > >>> >>> >> >
> > >>> >>> >>
> > >>> >>> >>
> > >>> >>> >> I wouldn't fixate on the thrift hop.  At SU we can do
> thousands
> > of
> > >>> ops
> > >>> >>> >> a second per node np from PHP frontend over thrift.
> > >>> >>> >>
> > >>> >>> >> 10k inserts a second per node into single CF might be doable.
> >  If
> > >>> into
> > >>> >>> >> 3CFs, then you need to recalibrate your expectations (I'd
> say).
> > >>> >>> >>
> > >>> >>> >> > Thanks in advance for any help, examples, or recommendations
> > >>> that you
> > >>> >>> can
> > >>> >>> >> > provide!
> > >>> >>> >> >
> > >>> >>> >> Sorry, the above is light on recommendations (for reasons
> cited
> > by
> > >>> >>> >> Ryan above -- smile).
> > >>> >>> >> St.Ack
> > >>> >>> >>
> > >>> >>> >
> > >>> >>>
> > >>> >>
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>
>
>
> --
> --Sean
>

Re: Read/Write Performance

Reply via email to