So in your step 2 you have the following:
FOREACH row IN TABLE alpha:
SELECT something
FROM TABLE alpha
WHERE alpha.url = row.url
Right?
And you are wondering why you are getting timeouts?
...
...
And how long does it take to do a full table scan? ;-)
(there's more, but that's the first thing you should see...)
Try creating a second table where you invert the URL and key pair such that for
each URL, you have a set of your alpha table's keys?
Then you have the following...
FOREACH row IN TABLE alpha:
FETCH key-set FROM beta
WHERE beta.rowkey = alpha.url
Note I use FETCH to signify that you should get a single row in response.
Does this make sense?
( your second table is actually and index of the URL column in your first table)
HTH
Sent from a remote device. Please excuse any typos...
Mike Segel
On Apr 19, 2012, at 5:43 AM, Narendra yadala <[email protected]> wrote:
> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop (4*32
> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution
> for maintaining our cluster. I have a single tweets table in which we store
> the tweets, one tweet per row (it has millions of rows currently).
>
> Now I try to run a Java batch (not a map reduce) which does the following :
>
> 1. Open a scanner over the tweet table and read the tweets one after
> another. I set scanner caching to 128 rows as higher scanner caching is
> leading to ScannerTimeoutExceptions. I scan over the first 10k rows only.
> 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there
> in that tweet and open another scanner over the tweets table to see who
> else shared that link. This involves getting rows having that URL from the
> entire table (not first 10k rows).
> 3. Do similar stuff as in step 2 for hashtags
> (hashtagcolfamily:hashtagvalue).
> 4. Do steps 1-3 in parallel for approximately 7-8 threads. This number
> can be higher (thousands also) later.
>
>
> When I run this batch I got the GC issue which is specified here
> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/
> Then I tried to turn on the MSLAB feature and changed the GC settings by
> specifying -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags.
> Even after doing this, I am running into all kinds of IOExceptions
> and SocketTimeoutExceptions.
>
> This Java batch opens approximately 7*2 (14) scanners open at a point in
> time and still I am running into all kinds of troubles. I am wondering
> whether I can have thousands of parallel scanners with HBase when I need to
> scale.
>
> It would be great to know whether I can open thousands/millions of scanners
> in parallel with HBase efficiently.
>
> Thanks
> Narendra