On Fri, Oct 7, 2011 at 8:58 PM, Stack <[email protected]> wrote: > On Fri, Oct 7, 2011 at 12:43 PM, Anthony Urso <[email protected]> wrote: >> We have a use case that will require a ten to twenty EC2 node HBase >> cluster to take several hundred million rows of input from a larger >> number of EMR instances in daily bursts, and then serve those rows via >> low latency random reads, say on the order of 300 or so rows per >> second. Before we start coding, I thought it best to ask the experts >> for their advice. >> >> 1) Is this something that HBase will be able to handle gracefully? > > You might have some chance if you were not on EC2. >
Is that because of the slow disk I/O? > Any chance of caching working? Are the reads totally random or will > there be 'hot' areas? If so, you might have some hope. > Hopefully. Do you mean external caching like memcache or OS-level disk caching? > >> 2) Does anyone have any pointers on how to tune HBase for performance >> and stability under this load? > > See performance section on book up on hbase.org (though there should > probably be EC2 caveats...) TY. > >> 3) Would HBase perform better under this sort of load on twelve large >> EC2 instances, six xlarge or three xxlarge? >> > > The more nodes the better. And if those nodes are not virtualized, > better still. But then there is the network and if its saturated.... > > > Can you run some tests before you start coding? Good idea. > St.Ack >
