Lesson learned...restart thrift servers *after* restarting hadoop+hbase.

On Thu, Dec 30, 2010 at 3:39 PM, Wayne <[email protected]> wrote:

> We have restarted with lzop compression, and now I am seeing some really
> long and frequent stop the world pauses of the entire cluster. The load
> requests for all regions all go to zero except for the meta table region. No
> data batches are getting in (no loads are occurring) and everything seems
> frozen. It seems to last for 5+ seconds. Is this GC on the master or GC in
> the meta region? What could cause everything to stop for several seconds? It
> appears to happen on a recurring basis as well. I think we saw it before
> switching to lzo but it seems much worse now (lasts longer and occurs more
> frequently).
>
> Thanks.
>
>
>
> On Thu, Dec 30, 2010 at 12:20 PM, Wayne <[email protected]> wrote:
>
>> HBase Version 0.89.20100924, r1001068 w/ 8GB heap
>>
>> I plan to run for 1 week straight maxed out. I am worried about GC pauses,
>> especially concurrent mode failures (does hbase/hadoop suffer these under
>> extended load?). What should I be looking for in the gc log in terms of
>> problem signs? The ParNews are quick but the CMS concurrent marks are taking
>> as much as 4 mins with an average of 20-30 secs.
>>
>> Thanks.
>>
>>
>>
>> On Thu, Dec 30, 2010 at 12:00 PM, Stack <[email protected]> wrote:
>>
>>> Oh, what versions are you using?
>>> St.Ack
>>>
>>> On Thu, Dec 30, 2010 at 9:00 AM, Stack <[email protected]> wrote:
>>> > Keep going. Let it run longer.  Get the servers as loaded as you think
>>> > they'll be in production.  Make sure the perf numbers are not because
>>> > cluster is 'fresh'.
>>> > St.Ack
>>> >
>>> > On Thu, Dec 30, 2010 at 5:51 AM, Wayne <[email protected]> wrote:
>>> >> We finally got our cluster up and running and write performance looks
>>> very
>>> >> good. We are getting sustained 8-10k writes/sec/node on a 10 node
>>> cluster
>>> >> from Python through thrift. These are values written to 3 CFs so
>>> actual
>>> >> hbase performance is 25-30k writes/sec/node. The nodes are currently
>>> disk
>>> >> i/o bound (40-50% utilization) but hopefully once we get lzop working
>>> this
>>> >> will go down. We have been running for 12 hours without a problem. We
>>> hope
>>> >> to get lzop going today and then load all through the long weekend.
>>> >>
>>> >> We plan to then test reads next week after we get some data in there.
>>> Looks
>>> >> good so far! Below are our settings in case there are some
>>> >> suggestions/concerns.
>>> >>
>>> >> Thanks for everyone's help. It is pretty exciting to get performance
>>> like
>>> >> this from the start.
>>> >>
>>> >>
>>> >> *Global*
>>> >>
>>> >> client.write.buffer = 10485760 (10MB = 5x default)
>>> >>
>>> >> optionalLogFlushInterval = 10000 (10 secs = 10x default)
>>> >>
>>> >> memstore.flush.size = 268435456 (256MB = 4x default)
>>> >>
>>> >> hregion.max.filesize = 1073741824 (1GB = 4x default)
>>> >>
>>> >> *Table*
>>> >>
>>> >> alter 'xxx', METHOD => 'table_att', DEFERRED_LOG_FLUSH => true
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Dec 29, 2010 at 12:55 AM, Stack <[email protected]> wrote:
>>> >>
>>> >>> On Mon, Dec 27, 2010 at 11:47 AM, Wayne <[email protected]> wrote:
>>> >>> > All data is written to 3 CFs. Basically 2 of the CFs are secondary
>>> >>> indexes
>>> >>> > (manually managed as normal CFs). It sounds like we should try hard
>>> to
>>> >>> get
>>> >>> > as much out of thrift as we can before going to a lower level.
>>> >>>
>>> >>> Yes.
>>> >>>
>>> >>> Writes need
>>> >>> > to be "fast enough", but reads are more important in the end (and
>>> are the
>>> >>> > reason we are switching from a different solution). The numbers you
>>> >>> quoted
>>> >>> > below sound like they are in the ballpark of what we are looking to
>>> do.
>>> >>> >
>>> >>>
>>> >>> Even the tens per second that I threw in there to CMA?
>>> >>>
>>> >>> > Much of our data is cold, and we expect reads to be disk i/o based.
>>> >>>
>>> >>> OK.  FYI, we're not the best at this -- cache-miss cold reads -- what
>>> >>> w/ a network hop in the way and currently we'll open a socket per
>>> >>> access.
>>> >>>
>>> >>> > Given
>>> >>> > this is 8GB heap a good place to start on the data nodes (24GB
>>> ram)? Is
>>> >>> the
>>> >>> > block cache managed on its own (being it won't blow up causing
>>> OOM),
>>> >>>
>>> >>> It won't.  Its constrained.  Does our home-brewed sizeof.  Default,
>>> >>> its 0.2 of total heap.  If you think cache will help, you could go up
>>> >>> from there.  0.4 or 0.5 of heap.
>>> >>>
>>> >>> > and if
>>> >>> > we do not use it (block cache) should we go even lower for the heap
>>> (we
>>> >>> want
>>> >>> > to avoid CMF and long GC pauses)?
>>> >>>
>>> >>> If you are going to be doing cache-miss most of the time and cold
>>> >>> reads, then yes, you can do away with cache.
>>> >>>
>>> >>> In testing of 0.90.x I've been running w/ 1MB heaps with 1k regions
>>> >>> but this is my trying to break stuff.
>>> >>>
>>> >>> > Are there any timeouts we need to tweak to
>>> >>> > make the cluster more "accepting" of long GC pauses while under
>>> sustained
>>> >>> > load (7+ days of 10k/inserts/sec/node)?
>>> >>> >
>>> >>>
>>> >>> If zookeeper client timesout, the regionserver will shut itself down.
>>> >>> In 0.90.0RC2, the client sessionout is set high -- 3 minutes.  If you
>>> >>> timeout that, then thats pretty extreme... something badly wrong I'd
>>> >>> say.  Heres' a few notes on the config and others that you might want
>>> >>> to twiddle (see previous section on required configs... make sure
>>> >>> you've got those too):
>>> >>>
>>> >>>
>>> http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations<http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations>
>>> <
>>> http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations
>>> >
>>> >>>
>>> >>>
>>> >>> > Does LZO compression speed up reads/writes where there is excess
>>> CPU to
>>> >>> do
>>> >>> > the compression? I assume it would lower disk i/o but increase CPU
>>> a lot.
>>> >>> Is
>>> >>> > data compressed on the initial write or only after compaction?
>>> >>> >
>>> >>>
>>> >>> LZO is pretty frictionless -- i.e. little CPU cost -- and yes,
>>> usually
>>> >>> helps speed things up (grab more in the one go).  What size are your
>>> >>> records?  You might want to mess w/ hfile block sizes though the 64k
>>> >>> default is usually good enough for all but very small cell sizes.
>>> >>>
>>> >>>
>>> >>> > With the replication in the HDFS layer how are reads managed in
>>> terms of
>>> >>> > load balancing across region servers? Does HDFS know to spread
>>> multiple
>>> >>> > requests across the 3 region servers that contain the same data?
>>> >>>
>>> >>> You only read from one of the replicas, always the 'closest'.  If the
>>> >>> DFSClient has trouble getting the first of the replicas, it moves on
>>> >>> to the second, etc.
>>> >>>
>>> >>>
>>> >>> > For example
>>> >>> > with 10 data nodes if we have 50 concurrent readers with very
>>> "random"
>>> >>> key
>>> >>> > requests we would expect to have 5 reads occurring on each data
>>> node at
>>> >>> the
>>> >>> > same time. We plan to have a thrift server on each data node, so 5
>>> >>> > concurrent readers will be connected to each thrift server at any
>>> given
>>> >>> time
>>> >>> > (50 in aggregate across 10 nodes). We want to be sure everything is
>>> >>> designed
>>> >>> > to evenly spread this load to avoid any possible hot-spots.
>>> >>> >
>>> >>>
>>> >>> This is different.  This is key design.  A thrift server will be
>>> doing
>>> >>> some subset of the key space.  If the requests are evenly distributed
>>> >>> over all of the key space, then you should be fine; all thrift
>>> servers
>>> >>> will be evenly loaded.  If not, then there could be hot spots.
>>> >>>
>>> >>> We have a balancer that currently only counts regions per server, not
>>> >>> regions per server plus hits per region so it could be the case that
>>> a
>>> >>> server by chance ends up carrying all of the hot regions.  HBase
>>> >>> itself is not too smart dealing with this.  In 0.90.0, there is
>>> >>> facility for manually moving regions -- i.e. closing in current
>>> >>> location and moving the region off to another server w/ some outage
>>> >>> while the move is happening (usually seconds) -- or you could split
>>> >>> the hot region manually and then the daughters could be moved off to
>>> >>> other servers... Primitive for now but should be better in next HBase
>>> >>> versions.
>>> >>>
>>> >>> Have you been able to test w/ your data and your query pattern?
>>> >>> That'll tell you way more than I ever could.
>>> >>>
>>> >>> Good luck,
>>> >>> St.Ack
>>> >>>
>>> >>>
>>> >>> >
>>> >>> >
>>> >>> > On Mon, Dec 27, 2010 at 1:49 PM, Stack <[email protected]> wrote:
>>> >>> >
>>> >>> >> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <[email protected]> wrote:
>>> >>> >> > We are in the process of evaluating hbase in an effort to switch
>>> from
>>> >>> a
>>> >>> >> > different nosql solution. Performance is of course an important
>>> part
>>> >>> of
>>> >>> >> our
>>> >>> >> > evaluation. We are a python shop and we are very worried that we
>>> can
>>> >>> not
>>> >>> >> get
>>> >>> >> > any real performance out of hbase using thrift (and must drop
>>> down to
>>> >>> >> java).
>>> >>> >> > We are aware of the various lower level options for bulk insert
>>> or
>>> >>> java
>>> >>> >> > based inserts with turning off WAL etc. but none of these are
>>> >>> available
>>> >>> >> to
>>> >>> >> > us in python so are not part of our evaluation.
>>> >>> >>
>>> >>> >> I can understand python for continuous updates from your frontend
>>> or
>>> >>> >> whatever but you might consider hacking up a bit of java to make
>>> us of
>>> >>> >> the bulk updater; you'll get upload rates orders of magnitude
>>> beyond
>>> >>> >> what you'd achieve going via the API via python (or java for that
>>> >>> >> matter).  You can also do incremental updates using the bulk
>>> loader.
>>> >>> >>
>>> >>> >>
>>> >>> >> We have a 10 node cluster
>>> >>> >> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region
>>> nodes, and
>>> >>> we
>>> >>> >> are
>>> >>> >> > looking for suggestions on configuration as well as benchmarks
>>> in
>>> >>> terms
>>> >>> >> of
>>> >>> >> > expectations of performance. Below are some specific questions.
>>> I
>>> >>> realize
>>> >>> >> > there are a million factors that help determine specific
>>> performance
>>> >>> >> > numbers, so any examples of performance from running clusters
>>> would be
>>> >>> >> great
>>> >>> >> > as examples of what can be done.
>>> >>> >>
>>> >>> >> Yeah, you have been around the block obviously. Its hard to give
>>> out
>>> >>> >> 'numbers' since so many different factors involved.
>>> >>> >>
>>> >>> >>
>>> >>> >> Again thrift seems to be our "problem" so
>>> >>> >> > non java based solutions are preferred (do any non java based
>>> shops
>>> >>> run
>>> >>> >> > large scale hbase clusters?). Our total production cluster size
>>> is
>>> >>> >> estimated
>>> >>> >> > to be 50TB.
>>> >>> >> >
>>> >>> >>
>>> >>> >> There are some substantial shops running non-java; e.g. the yfrog
>>> >>> >> folks go via REST, the mozilla fellas are python over thrift,
>>> >>> >> Stumbleupon is php over thrift.
>>> >>> >>
>>> >>> >> > Our data model is 3 CFs, one primary and 2 secondary indexes.
>>> All
>>> >>> writes
>>> >>> >> go
>>> >>> >> > to all 3 CFs and are grouped as a batch of row mutations which
>>> should
>>> >>> >> avoid
>>> >>> >> > row locking issues.
>>> >>> >> >
>>> >>> >>
>>> >>> >> A write updates 3CFs and secondary indices?  Thats an expensive
>>> Put
>>> >>> >> relatively.  You have to run w/ 3CFs?  It facilitates fast
>>> querying?
>>> >>> >>
>>> >>> >>
>>> >>> >> > What heap size is recommended for master, and for region servers
>>> (24gb
>>> >>> >> ram)?
>>> >>> >>
>>> >>> >> Master doesn't take much heap, at least not in the coming 0.90.0
>>> HBase
>>> >>> >> (Is that what you intend to run)?
>>> >>> >>
>>> >>> >> The more RAM you give the regionservers, the more cache your
>>> cluster
>>> >>> will
>>> >>> >> have.
>>> >>> >>
>>> >>> >> Whats important to you read or write times?
>>> >>> >>
>>> >>> >>
>>> >>> >> > What other settings can/should be tweaked in hbase to optimize
>>> >>> >> performance
>>> >>> >> > (we have looked at the wiki page)?
>>> >>> >>
>>> >>> >> Thats a good place to start.  Take a look through this mailing
>>> list
>>> >>> >> for others (Its time for a trawl of mailing list and then
>>> distilling
>>> >>> >> the findings into a reedit of our perf page).
>>> >>> >>
>>> >>> >> > What is a good batch size for writes? We will start with 10k
>>> >>> >> values/batch.
>>> >>> >>
>>> >>> >> Start small with defaults.  Make sure its all running smooth
>>> first.
>>> >>> >> Then rachet it up.
>>> >>> >>
>>> >>> >>
>>> >>> >> > How many concurrent writers/readers can a single data node
>>> handle with
>>> >>> >> > evenly distributed load? Are there settings specific to this?
>>> >>> >>
>>> >>> >> How many clients you going to have writing HBase?
>>> >>> >>
>>> >>> >>
>>> >>> >> > What is "very good" read/write latency for a single put/get in
>>> hbase
>>> >>> >> using
>>> >>> >> > thrift?
>>> >>> >>
>>> >>> >> "Very Good" would be < a few milliseconds.
>>> >>> >>
>>> >>> >>
>>> >>> >> > What is "very good" read/write throughput per node in hbase
>>> using
>>> >>> thrift?
>>> >>> >> >
>>> >>> >>
>>> >>> >> Thousands of ops per second per regionserver (Sorry, can't be more
>>> >>> >> specific than that).  If the Puts are multi-family + updates on
>>> >>> >> secondary indices, hundreds -- maybe even tens... I'm not sure --
>>> >>> >> rather than thousands.
>>> >>> >>
>>> >>> >> > We are looking to get performance numbers in the range of 10k
>>> >>> aggregate
>>> >>> >> > inserts/sec/node and read latency < 30ms/read with 3-4
>>> concurrent
>>> >>> >> > readers/node. Can our expectations be met with hbase through
>>> thrift?
>>> >>> Can
>>> >>> >> > they be met with hbase through java?
>>> >>> >> >
>>> >>> >>
>>> >>> >>
>>> >>> >> I wouldn't fixate on the thrift hop.  At SU we can do thousands of
>>> ops
>>> >>> >> a second per node np from PHP frontend over thrift.
>>> >>> >>
>>> >>> >> 10k inserts a second per node into single CF might be doable.  If
>>> into
>>> >>> >> 3CFs, then you need to recalibrate your expectations (I'd say).
>>> >>> >>
>>> >>> >> > Thanks in advance for any help, examples, or recommendations
>>> that you
>>> >>> can
>>> >>> >> > provide!
>>> >>> >> >
>>> >>> >> Sorry, the above is light on recommendations (for reasons cited by
>>> >>> >> Ryan above -- smile).
>>> >>> >> St.Ack
>>> >>> >>
>>> >>> >
>>> >>>
>>> >>
>>> >
>>>
>>
>>
>

Reply via email to