I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32 GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution for maintaining our cluster. I have a single tweets table in which we store the tweets, one tweet per row (it has millions of rows currently).
Now I try to run a Java batch (not a map reduce) which does the following : 1. Open a scanner over the tweet table and read the tweets one after another. I set scanner caching to 128 rows as higher scanner caching is leading to ScannerTimeoutExceptions. I scan over the first 10k rows only. 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there in that tweet and open another scanner over the tweets table to see who else shared that link. This involves getting rows having that URL from the entire table (not first 10k rows). 3. Do similar stuff as in step 2 for hashtags (hashtagcolfamily:hashtagvalue). 4. Do steps 1-3 in parallel for approximately 7-8 threads. This number can be higher (thousands also) later. When I run this batch I got the GC issue which is specified here http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ Then I tried to turn on the MSLAB feature and changed the GC settings by specifying -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags. Even after doing this, I am running into all kinds of IOExceptions and SocketTimeoutExceptions. This Java batch opens approximately 7*2 (14) scanners open at a point in time and still I am running into all kinds of troubles. I am wondering whether I can have thousands of parallel scanners with HBase when I need to scale. It would be great to know whether I can open thousands/millions of scanners in parallel with HBase efficiently. Thanks Narendra
