That's great to hear, thanks for sharing your results and configuration. How many concurrent writer processes are you running?
Thanks, Michael On Thu, Dec 30, 2010 at 8:51 AM, Wayne <[email protected]> wrote: > We finally got our cluster up and running and write performance looks very > good. We are getting sustained 8-10k writes/sec/node on a 10 node cluster > from Python through thrift. These are values written to 3 CFs so actual > hbase performance is 25-30k writes/sec/node. The nodes are currently disk > i/o bound (40-50% utilization) but hopefully once we get lzop working this > will go down. We have been running for 12 hours without a problem. We hope > to get lzop going today and then load all through the long weekend. > > We plan to then test reads next week after we get some data in there. Looks > good so far! Below are our settings in case there are some > suggestions/concerns. > > Thanks for everyone's help. It is pretty exciting to get performance like > this from the start. > > > *Global* > > client.write.buffer = 10485760 (10MB = 5x default) > > optionalLogFlushInterval = 10000 (10 secs = 10x default) > > memstore.flush.size = 268435456 (256MB = 4x default) > > hregion.max.filesize = 1073741824 (1GB = 4x default) > > *Table* > > alter 'xxx', METHOD => 'table_att', DEFERRED_LOG_FLUSH => true > > > > > > On Wed, Dec 29, 2010 at 12:55 AM, Stack <[email protected]> wrote: > > > On Mon, Dec 27, 2010 at 11:47 AM, Wayne <[email protected]> wrote: > > > All data is written to 3 CFs. Basically 2 of the CFs are secondary > > indexes > > > (manually managed as normal CFs). It sounds like we should try hard to > > get > > > as much out of thrift as we can before going to a lower level. > > > > Yes. > > > > Writes need > > > to be "fast enough", but reads are more important in the end (and are > the > > > reason we are switching from a different solution). The numbers you > > quoted > > > below sound like they are in the ballpark of what we are looking to do. > > > > > > > Even the tens per second that I threw in there to CMA? > > > > > Much of our data is cold, and we expect reads to be disk i/o based. > > > > OK. FYI, we're not the best at this -- cache-miss cold reads -- what > > w/ a network hop in the way and currently we'll open a socket per > > access. > > > > > Given > > > this is 8GB heap a good place to start on the data nodes (24GB ram)? Is > > the > > > block cache managed on its own (being it won't blow up causing OOM), > > > > It won't. Its constrained. Does our home-brewed sizeof. Default, > > its 0.2 of total heap. If you think cache will help, you could go up > > from there. 0.4 or 0.5 of heap. > > > > > and if > > > we do not use it (block cache) should we go even lower for the heap (we > > want > > > to avoid CMF and long GC pauses)? > > > > If you are going to be doing cache-miss most of the time and cold > > reads, then yes, you can do away with cache. > > > > In testing of 0.90.x I've been running w/ 1MB heaps with 1k regions > > but this is my trying to break stuff. > > > > > Are there any timeouts we need to tweak to > > > make the cluster more "accepting" of long GC pauses while under > sustained > > > load (7+ days of 10k/inserts/sec/node)? > > > > > > > If zookeeper client timesout, the regionserver will shut itself down. > > In 0.90.0RC2, the client sessionout is set high -- 3 minutes. If you > > timeout that, then thats pretty extreme... something badly wrong I'd > > say. Heres' a few notes on the config and others that you might want > > to twiddle (see previous section on required configs... make sure > > you've got those too): > > > > > http://people.apache.org/~stack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations > < > http://people.apache.org/%7Estack/hbase-0.90.0-candidate-2/docs/important_configurations.html#recommended_configurations > > > > > > > > > Does LZO compression speed up reads/writes where there is excess CPU to > > do > > > the compression? I assume it would lower disk i/o but increase CPU a > lot. > > Is > > > data compressed on the initial write or only after compaction? > > > > > > > LZO is pretty frictionless -- i.e. little CPU cost -- and yes, usually > > helps speed things up (grab more in the one go). What size are your > > records? You might want to mess w/ hfile block sizes though the 64k > > default is usually good enough for all but very small cell sizes. > > > > > > > With the replication in the HDFS layer how are reads managed in terms > of > > > load balancing across region servers? Does HDFS know to spread multiple > > > requests across the 3 region servers that contain the same data? > > > > You only read from one of the replicas, always the 'closest'. If the > > DFSClient has trouble getting the first of the replicas, it moves on > > to the second, etc. > > > > > > > For example > > > with 10 data nodes if we have 50 concurrent readers with very "random" > > key > > > requests we would expect to have 5 reads occurring on each data node at > > the > > > same time. We plan to have a thrift server on each data node, so 5 > > > concurrent readers will be connected to each thrift server at any given > > time > > > (50 in aggregate across 10 nodes). We want to be sure everything is > > designed > > > to evenly spread this load to avoid any possible hot-spots. > > > > > > > This is different. This is key design. A thrift server will be doing > > some subset of the key space. If the requests are evenly distributed > > over all of the key space, then you should be fine; all thrift servers > > will be evenly loaded. If not, then there could be hot spots. > > > > We have a balancer that currently only counts regions per server, not > > regions per server plus hits per region so it could be the case that a > > server by chance ends up carrying all of the hot regions. HBase > > itself is not too smart dealing with this. In 0.90.0, there is > > facility for manually moving regions -- i.e. closing in current > > location and moving the region off to another server w/ some outage > > while the move is happening (usually seconds) -- or you could split > > the hot region manually and then the daughters could be moved off to > > other servers... Primitive for now but should be better in next HBase > > versions. > > > > Have you been able to test w/ your data and your query pattern? > > That'll tell you way more than I ever could. > > > > Good luck, > > St.Ack > > > > > > > > > > > > > On Mon, Dec 27, 2010 at 1:49 PM, Stack <[email protected]> wrote: > > > > > >> On Fri, Dec 24, 2010 at 5:09 AM, Wayne <[email protected]> wrote: > > >> > We are in the process of evaluating hbase in an effort to switch > from > > a > > >> > different nosql solution. Performance is of course an important part > > of > > >> our > > >> > evaluation. We are a python shop and we are very worried that we can > > not > > >> get > > >> > any real performance out of hbase using thrift (and must drop down > to > > >> java). > > >> > We are aware of the various lower level options for bulk insert or > > java > > >> > based inserts with turning off WAL etc. but none of these are > > available > > >> to > > >> > us in python so are not part of our evaluation. > > >> > > >> I can understand python for continuous updates from your frontend or > > >> whatever but you might consider hacking up a bit of java to make us of > > >> the bulk updater; you'll get upload rates orders of magnitude beyond > > >> what you'd achieve going via the API via python (or java for that > > >> matter). You can also do incremental updates using the bulk loader. > > >> > > >> > > >> We have a 10 node cluster > > >> > (24gb, 6 x 1TB, 16 core) that we setting up as data/region nodes, > and > > we > > >> are > > >> > looking for suggestions on configuration as well as benchmarks in > > terms > > >> of > > >> > expectations of performance. Below are some specific questions. I > > realize > > >> > there are a million factors that help determine specific performance > > >> > numbers, so any examples of performance from running clusters would > be > > >> great > > >> > as examples of what can be done. > > >> > > >> Yeah, you have been around the block obviously. Its hard to give out > > >> 'numbers' since so many different factors involved. > > >> > > >> > > >> Again thrift seems to be our "problem" so > > >> > non java based solutions are preferred (do any non java based shops > > run > > >> > large scale hbase clusters?). Our total production cluster size is > > >> estimated > > >> > to be 50TB. > > >> > > > >> > > >> There are some substantial shops running non-java; e.g. the yfrog > > >> folks go via REST, the mozilla fellas are python over thrift, > > >> Stumbleupon is php over thrift. > > >> > > >> > Our data model is 3 CFs, one primary and 2 secondary indexes. All > > writes > > >> go > > >> > to all 3 CFs and are grouped as a batch of row mutations which > should > > >> avoid > > >> > row locking issues. > > >> > > > >> > > >> A write updates 3CFs and secondary indices? Thats an expensive Put > > >> relatively. You have to run w/ 3CFs? It facilitates fast querying? > > >> > > >> > > >> > What heap size is recommended for master, and for region servers > (24gb > > >> ram)? > > >> > > >> Master doesn't take much heap, at least not in the coming 0.90.0 HBase > > >> (Is that what you intend to run)? > > >> > > >> The more RAM you give the regionservers, the more cache your cluster > > will > > >> have. > > >> > > >> Whats important to you read or write times? > > >> > > >> > > >> > What other settings can/should be tweaked in hbase to optimize > > >> performance > > >> > (we have looked at the wiki page)? > > >> > > >> Thats a good place to start. Take a look through this mailing list > > >> for others (Its time for a trawl of mailing list and then distilling > > >> the findings into a reedit of our perf page). > > >> > > >> > What is a good batch size for writes? We will start with 10k > > >> values/batch. > > >> > > >> Start small with defaults. Make sure its all running smooth first. > > >> Then rachet it up. > > >> > > >> > > >> > How many concurrent writers/readers can a single data node handle > with > > >> > evenly distributed load? Are there settings specific to this? > > >> > > >> How many clients you going to have writing HBase? > > >> > > >> > > >> > What is "very good" read/write latency for a single put/get in hbase > > >> using > > >> > thrift? > > >> > > >> "Very Good" would be < a few milliseconds. > > >> > > >> > > >> > What is "very good" read/write throughput per node in hbase using > > thrift? > > >> > > > >> > > >> Thousands of ops per second per regionserver (Sorry, can't be more > > >> specific than that). If the Puts are multi-family + updates on > > >> secondary indices, hundreds -- maybe even tens... I'm not sure -- > > >> rather than thousands. > > >> > > >> > We are looking to get performance numbers in the range of 10k > > aggregate > > >> > inserts/sec/node and read latency < 30ms/read with 3-4 concurrent > > >> > readers/node. Can our expectations be met with hbase through thrift? > > Can > > >> > they be met with hbase through java? > > >> > > > >> > > >> > > >> I wouldn't fixate on the thrift hop. At SU we can do thousands of ops > > >> a second per node np from PHP frontend over thrift. > > >> > > >> 10k inserts a second per node into single CF might be doable. If into > > >> 3CFs, then you need to recalibrate your expectations (I'd say). > > >> > > >> > Thanks in advance for any help, examples, or recommendations that > you > > can > > >> > provide! > > >> > > > >> Sorry, the above is light on recommendations (for reasons cited by > > >> Ryan above -- smile). > > >> St.Ack > > >> > > > > > >
