I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32
GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
for maintaining our cluster. I have a single tweets table in which we store
the tweets, one tweet per row (it has millions of rows currently).

Now I try to run a Java batch (not a map reduce) which does the following :

   1. Open a scanner over the tweet table and read the tweets one after
   another. I set scanner caching to 128 rows as higher scanner caching is
   leading to ScannerTimeoutExceptions. I scan over the first 10k rows only.
   2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
   in that tweet and open another scanner over the tweets table to see who
   else shared that link. This involves getting rows having that URL from the
   entire table (not first 10k rows).
   3. Do similar stuff as in step 2 for hashtags
   (hashtagcolfamily:hashtagvalue).
   4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
   can be higher (thousands also) later.


When I run this batch I got the GC issue which is specified here
http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
Then I tried to turn on the MSLAB feature and changed the GC settings by
specifying  -XX:+UseParNewGC  and  -XX:+UseConcMarkSweepGC JVM flags.
Even after doing this, I am running into all kinds of IOExceptions
and SocketTimeoutExceptions.

This Java batch opens approximately 7*2 (14) scanners open at a point in
time and still I am running into all kinds of troubles. I am wondering
whether I can have thousands of parallel scanners with HBase when I need to
scale.

It would be great to know whether I can open thousands/millions of scanners
in parallel with HBase efficiently.

Thanks
Narendra

Reply via email to