On Fri, Dec 24, 2010 at 5:09 AM, Wayne <[email protected]> wrote: > We are in the process of evaluating hbase in an effort to switch from a > different nosql solution. Performance is of course an important part of our > evaluation. We are a python shop and we are very worried that we can not get > any real performance out of hbase using thrift (and must drop down to java). > We are aware of the various lower level options for bulk insert or java > based inserts with turning off WAL etc. but none of these are available to > us in python so are not part of our evaluation.
I can understand python for continuous updates from your frontend or whatever but you might consider hacking up a bit of java to make us of the bulk updater; you'll get upload rates orders of magnitude beyond what you'd achieve going via the API via python (or java for that matter). You can also do incremental updates using the bulk loader. We have a 10 node cluster > (24gb, 6 x 1TB, 16 core) that we setting up as data/region nodes, and we are > looking for suggestions on configuration as well as benchmarks in terms of > expectations of performance. Below are some specific questions. I realize > there are a million factors that help determine specific performance > numbers, so any examples of performance from running clusters would be great > as examples of what can be done. Yeah, you have been around the block obviously. Its hard to give out 'numbers' since so many different factors involved. Again thrift seems to be our "problem" so > non java based solutions are preferred (do any non java based shops run > large scale hbase clusters?). Our total production cluster size is estimated > to be 50TB. > There are some substantial shops running non-java; e.g. the yfrog folks go via REST, the mozilla fellas are python over thrift, Stumbleupon is php over thrift. > Our data model is 3 CFs, one primary and 2 secondary indexes. All writes go > to all 3 CFs and are grouped as a batch of row mutations which should avoid > row locking issues. > A write updates 3CFs and secondary indices? Thats an expensive Put relatively. You have to run w/ 3CFs? It facilitates fast querying? > What heap size is recommended for master, and for region servers (24gb ram)? Master doesn't take much heap, at least not in the coming 0.90.0 HBase (Is that what you intend to run)? The more RAM you give the regionservers, the more cache your cluster will have. Whats important to you read or write times? > What other settings can/should be tweaked in hbase to optimize performance > (we have looked at the wiki page)? Thats a good place to start. Take a look through this mailing list for others (Its time for a trawl of mailing list and then distilling the findings into a reedit of our perf page). > What is a good batch size for writes? We will start with 10k values/batch. Start small with defaults. Make sure its all running smooth first. Then rachet it up. > How many concurrent writers/readers can a single data node handle with > evenly distributed load? Are there settings specific to this? How many clients you going to have writing HBase? > What is "very good" read/write latency for a single put/get in hbase using > thrift? "Very Good" would be < a few milliseconds. > What is "very good" read/write throughput per node in hbase using thrift? > Thousands of ops per second per regionserver (Sorry, can't be more specific than that). If the Puts are multi-family + updates on secondary indices, hundreds -- maybe even tens... I'm not sure -- rather than thousands. > We are looking to get performance numbers in the range of 10k aggregate > inserts/sec/node and read latency < 30ms/read with 3-4 concurrent > readers/node. Can our expectations be met with hbase through thrift? Can > they be met with hbase through java? > I wouldn't fixate on the thrift hop. At SU we can do thousands of ops a second per node np from PHP frontend over thrift. 10k inserts a second per node into single CF might be doable. If into 3CFs, then you need to recalibrate your expectations (I'd say). > Thanks in advance for any help, examples, or recommendations that you can > provide! > Sorry, the above is light on recommendations (for reasons cited by Ryan above -- smile). St.Ack
