>  Also, do you think if I query using rowkey instead of hbase time stamp..it 
> would not kick off that many tasks..
> since region server knows the exact locations?

I don't see how you could do that in a scalable way, unless you really
have to query a few rows (less than a million).

>
>
>
> I've a 10 node cluster each with 36 gig..I've allocated 4gig for HBase Region 
> Servers..master.jsp
> reports used heap is less than half on each region server.
>

This is Java so the reported heap doesn't mean much... the garbage
collector doesn't collect aggressively since that would be awfully
inefficient.

>
>  I've close to 800 regions total..Guess it needs to kick off a jvm to see if 
> data exists
> in all regions..


It does, and like you said the mappers take only a few minutes so
optimizing that part of the job is useless until you get your reducers
faster.

So regarding the speed of inserts (this seem to be the real issue if
what you said about the write buffer is true), I'd be interested in 1)
seeing your reducer's code (strip whatever you have that's business
specific) and 2) seeing some monitoring data while the job is running
(if not, get ganglia in there). Inserts could be slow for many reasons
apart from bad API usage, such as cluster misconfiguration,
sub-optimal insertion pattern (the classic being having only 1
region), etc.

J-D

Reply via email to