> Also, do you think if I query using rowkey instead of hbase time stamp..it > would not kick off that many tasks.. > since region server knows the exact locations?
I don't see how you could do that in a scalable way, unless you really have to query a few rows (less than a million). > > > > I've a 10 node cluster each with 36 gig..I've allocated 4gig for HBase Region > Servers..master.jsp > reports used heap is less than half on each region server. > This is Java so the reported heap doesn't mean much... the garbage collector doesn't collect aggressively since that would be awfully inefficient. > > I've close to 800 regions total..Guess it needs to kick off a jvm to see if > data exists > in all regions.. It does, and like you said the mappers take only a few minutes so optimizing that part of the job is useless until you get your reducers faster. So regarding the speed of inserts (this seem to be the real issue if what you said about the write buffer is true), I'd be interested in 1) seeing your reducer's code (strip whatever you have that's business specific) and 2) seeing some monitoring data while the job is running (if not, get ganglia in there). Inserts could be slow for many reasons apart from bad API usage, such as cluster misconfiguration, sub-optimal insertion pattern (the classic being having only 1 region), etc. J-D