Re: Read/Write Performance

Michael Russo Thu, 30 Dec 2010 06:57:57 -0800

That's great to hear, thanks for sharing your results and configuration.

How many concurrent writer processes are you running?


Thanks,
Michael

On Thu, Dec 30, 2010 at 8:51 AM, Wayne <[email protected]> wrote:

> We finally got our cluster up and running and write performance looks very
> good. We are getting sustained 8-10k writes/sec/node on a 10 node cluster
> from Python through thrift. These are values written to 3 CFs so actual
> hbase performance is 25-30k writes/sec/node. The nodes are currently disk
> i/o bound (40-50% utilization) but hopefully once we get lzop working this
> will go down. We have been running for 12 hours without a problem. We hope
> to get lzop going today and then load all through the long weekend.
>
> We plan to then test reads next week after we get some data in there. Looks
> good so far! Below are our settings in case there are some
> suggestions/concerns.
>
> Thanks for everyone's help. It is pretty exciting to get performance like
> this from the start.
>
>
> *Global*
>
> client.write.buffer = 10485760 (10MB = 5x default)
>
> optionalLogFlushInterval = 10000 (10 secs = 10x default)
>
> memstore.flush.size = 268435456 (256MB = 4x default)
>
> hregion.max.filesize = 1073741824 (1GB = 4x default)
>
> *Table*
>
> alter 'xxx', METHOD => 'table_att', DEFERRED_LOG_FLUSH => true
>
>
>
>
>
> On Wed, Dec 29, 2010 at 12:55 AM, Stack <[email protected]> wrote:
>
> > On Mon, Dec 27, 2010 at 11:47 AM, Wayne <[email protected]> wrote:
> > > All data is written to 3 CFs. Basically 2 of the CFs are secondary
> > indexes
> > > (manually managed as normal CFs). It sounds like we should try hard to
> > get
> > > as much out of thrift as we can before going to a lower level.
> >
> > Yes.
> >
> > Writes need
> > > to be "fast enough", but reads are more important in the end (and are
> the
> > > reason we are switching from a different solution). The numbers you
> > quoted
> > > below sound like they are in the ballpark of what we are looking to do.
> > >
> >
> > Even the tens per second that I threw in there to CMA?
> >
> > > Much of our data is cold, and we expect reads to be disk i/o based.
> >
> > OK.  FYI, we're not the best at this -- cache-miss cold reads -- what
> > w/ a network hop in the way and currently we'll open a socket per
> > access.
> >
> > > Given
> > > this is 8GB heap a good place to start on the data nodes (24GB ram)? Is
> > the
> > > block cache managed on its own (being it won't blow up causing OOM),
> >
> > It won't.  Its constrained.  Does our home-brewed sizeof.  Default,
> > its 0.2 of total heap.  If you think cache will help, you could go up
> > from there.  0.4 or 0.5 of heap.
> >
> > > and if
> > > we do not use it (block cache) should we go even lower for the heap (we
> > want
> > > to avoid CMF and long GC pauses)?
> >
> > If you are going to be doing cache-miss most of the time and cold
> > reads, then yes, you can do away with cache.
> >
> > In testing of 0.90.x I've been running w/ 1MB heaps with 1k regions
> > but this is my trying to break stuff.
> >
> > > Are there any timeouts we need to tweak to
> > > make the cluster more "accepting" of long GC pauses while under
> sustained
> > > load (7+ days of 10k/inserts/sec/node)?
> > >
> >
> > If zookeeper client timesout, the regionserver will shut itself down.
> > In 0.90.0RC2, the client sessionout is set high -- 3 minutes.  If you
> > timeout that, then thats pretty extreme... something badly wrong I'd
> > say.  Heres' a few notes on the config and others that you might want
> > to twiddle (see previous section on required configs... make sure
> > you've got those too):
> >
> >
> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations
> <
> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations
> >
> >
> >
> > > Does LZO compression speed up reads/writes where there is excess CPU to
> > do
> > > the compression? I assume it would lower disk i/o but increase CPU a
> lot.
> > Is
> > > data compressed on the initial write or only after compaction?
> > >
> >
> > LZO is pretty frictionless -- i.e. little CPU cost -- and yes, usually
> > helps speed things up (grab more in the one go).  What size are your
> > records?  You might want to mess w/ hfile block sizes though the 64k
> > default is usually good enough for all but very small cell sizes.
> >
> >
> > > With the replication in the HDFS layer how are reads managed in terms
> of
> > > load balancing across region servers? Does HDFS know to spread multiple
> > > requests across the 3 region servers that contain the same data?
> >
> > You only read from one of the replicas, always the 'closest'.  If the
> > DFSClient has trouble getting the first of the replicas, it moves on
> > to the second, etc.
> >
> >
> > > For example
> > > with 10 data nodes if we have 50 concurrent readers with very "random"
> > key
> > > requests we would expect to have 5 reads occurring on each data node at
> > the
> > > same time. We plan to have a thrift server on each data node, so 5
> > > concurrent readers will be connected to each thrift server at any given
> > time
> > > (50 in aggregate across 10 nodes). We want to be sure everything is
> > designed
> > > to evenly spread this load to avoid any possible hot-spots.
> > >
> >
> > This is different.  This is key design.  A thrift server will be doing
> > some subset of the key space.  If the requests are evenly distributed
> > over all of the key space, then you should be fine; all thrift servers
> > will be evenly loaded.  If not, then there could be hot spots.
> >
> > We have a balancer that currently only counts regions per server, not
> > regions per server plus hits per region so it could be the case that a
> > server by chance ends up carrying all of the hot regions.  HBase
> > itself is not too smart dealing with this.  In 0.90.0, there is
> > facility for manually moving regions -- i.e. closing in current
> > location and moving the region off to another server w/ some outage
> > while the move is happening (usually seconds) -- or you could split
> > the hot region manually and then the daughters could be moved off to
> > other servers... Primitive for now but should be better in next HBase
> > versions.
> >
> > Have you been able to test w/ your data and your query pattern?
> > That'll tell you way more than I ever could.
> >
> > Good luck,
> > St.Ack
> >
> >
> > >
> > >
> > > On Mon, Dec 27, 2010 at 1:49 PM, Stack <[email protected]> wrote:
> > >
> > >> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <[email protected]> wrote:
> > >> > We are in the process of evaluating hbase in an effort to switch
> from
> > a
> > >> > different nosql solution. Performance is of course an important part
> > of
> > >> our
> > >> > evaluation. We are a python shop and we are very worried that we can
> > not
> > >> get
> > >> > any real performance out of hbase using thrift (and must drop down
> to
> > >> java).
> > >> > We are aware of the various lower level options for bulk insert or
> > java
> > >> > based inserts with turning off WAL etc. but none of these are
> > available
> > >> to
> > >> > us in python so are not part of our evaluation.
> > >>
> > >> I can understand python for continuous updates from your frontend or
> > >> whatever but you might consider hacking up a bit of java to make us of
> > >> the bulk updater; you'll get upload rates orders of magnitude beyond
> > >> what you'd achieve going via the API via python (or java for that
> > >> matter).  You can also do incremental updates using the bulk loader.
> > >>
> > >>
> > >> We have a 10 node cluster
> > >> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region nodes,
> and
> > we
> > >> are
> > >> > looking for suggestions on configuration as well as benchmarks in
> > terms
> > >> of
> > >> > expectations of performance. Below are some specific questions. I
> > realize
> > >> > there are a million factors that help determine specific performance
> > >> > numbers, so any examples of performance from running clusters would
> be
> > >> great
> > >> > as examples of what can be done.
> > >>
> > >> Yeah, you have been around the block obviously. Its hard to give out
> > >> 'numbers' since so many different factors involved.
> > >>
> > >>
> > >> Again thrift seems to be our "problem" so
> > >> > non java based solutions are preferred (do any non java based shops
> > run
> > >> > large scale hbase clusters?). Our total production cluster size is
> > >> estimated
> > >> > to be 50TB.
> > >> >
> > >>
> > >> There are some substantial shops running non-java; e.g. the yfrog
> > >> folks go via REST, the mozilla fellas are python over thrift,
> > >> Stumbleupon is php over thrift.
> > >>
> > >> > Our data model is 3 CFs, one primary and 2 secondary indexes. All
> > writes
> > >> go
> > >> > to all 3 CFs and are grouped as a batch of row mutations which
> should
> > >> avoid
> > >> > row locking issues.
> > >> >
> > >>
> > >> A write updates 3CFs and secondary indices?  Thats an expensive Put
> > >> relatively.  You have to run w/ 3CFs?  It facilitates fast querying?
> > >>
> > >>
> > >> > What heap size is recommended for master, and for region servers
> (24gb
> > >> ram)?
> > >>
> > >> Master doesn't take much heap, at least not in the coming 0.90.0 HBase
> > >> (Is that what you intend to run)?
> > >>
> > >> The more RAM you give the regionservers, the more cache your cluster
> > will
> > >> have.
> > >>
> > >> Whats important to you read or write times?
> > >>
> > >>
> > >> > What other settings can/should be tweaked in hbase to optimize
> > >> performance
> > >> > (we have looked at the wiki page)?
> > >>
> > >> Thats a good place to start.  Take a look through this mailing list
> > >> for others (Its time for a trawl of mailing list and then distilling
> > >> the findings into a reedit of our perf page).
> > >>
> > >> > What is a good batch size for writes? We will start with 10k
> > >> values/batch.
> > >>
> > >> Start small with defaults.  Make sure its all running smooth first.
> > >> Then rachet it up.
> > >>
> > >>
> > >> > How many concurrent writers/readers can a single data node handle
> with
> > >> > evenly distributed load? Are there settings specific to this?
> > >>
> > >> How many clients you going to have writing HBase?
> > >>
> > >>
> > >> > What is "very good" read/write latency for a single put/get in hbase
> > >> using
> > >> > thrift?
> > >>
> > >> "Very Good" would be < a few milliseconds.
> > >>
> > >>
> > >> > What is "very good" read/write throughput per node in hbase using
> > thrift?
> > >> >
> > >>
> > >> Thousands of ops per second per regionserver (Sorry, can't be more
> > >> specific than that).  If the Puts are multi-family + updates on
> > >> secondary indices, hundreds -- maybe even tens... I'm not sure --
> > >> rather than thousands.
> > >>
> > >> > We are looking to get performance numbers in the range of 10k
> > aggregate
> > >> > inserts/sec/node and read latency < 30ms/read with 3-4 concurrent
> > >> > readers/node. Can our expectations be met with hbase through thrift?
> > Can
> > >> > they be met with hbase through java?
> > >> >
> > >>
> > >>
> > >> I wouldn't fixate on the thrift hop.  At SU we can do thousands of ops
> > >> a second per node np from PHP frontend over thrift.
> > >>
> > >> 10k inserts a second per node into single CF might be doable.  If into
> > >> 3CFs, then you need to recalibrate your expectations (I'd say).
> > >>
> > >> > Thanks in advance for any help, examples, or recommendations that
> you
> > can
> > >> > provide!
> > >> >
> > >> Sorry, the above is light on recommendations (for reasons cited by
> > >> Ryan above -- smile).
> > >> St.Ack
> > >>
> > >
> >
>

Re: Read/Write Performance

Reply via email to