Narendra, Are you trying to solve a real problem, or is this a class project?
Your solution doesn't scale. It's a non starter. 130 seconds for each iteration times 1 million seconds is how long? 130 million seconds, which is ~36000 hours or over 4 years to complete. (the numbers are rough but you get the idea...) That's assuming that your table is static and doesn't change. I didn't even ask if you were attempting any sort of server side filtering which would reduce the amount of data you send back to the client because it a moot point. Finer tuning is also moot. So you insert a row in one table. You then do n^2 operations to pull out data. The better solution is to insert data into 2 tables where you then have to do 2n operations to get the same results. Thats per thread btw. So if you were running 10 threads, you would have 10n^2 operations versus 20n operations to get the same result set. A million row table... 1*10^13. Vs 2*10^6 I don't believe I mentioned anything about HBase's internals and this solution works for any NoSQL database. Sent from a remote device. Please excuse any typos... Mike Segel On Apr 19, 2012, at 7:03 AM, Narendra yadala <[email protected]> wrote: > Hi Michel > > Yes, that is exactly what I do in step 2. I am aware of the reason for the > scanner timeout exceptions. It is the time between two consecutive > invocations of the next call on a specific scanner object. I increased the > scanner timeout to 10 min on the region server and still I keep seeing the > timeouts. So I reduced my scanner cache to 128. > > Full table scan takes 130 seconds and there are 2.2 million rows in the > table as of now. Each row is around 2 KB in size. I measured time for the > full table scan by issuing `count` command from the hbase shell. > > I kind of understood the fix that you are specifying, but do I need to > change the table structure to fix this problem? All I do is a n^2 operation > and even that fails with 10 different types of exceptions. It is mildly > annoying that I need to know all the low level storage details of HBase to > do such a simple operation. And this is happening for just 14 parallel > scanners. I am wondering what would happen when there are thousands of > parallel scanners. > > Please let me know if there is any configuration param change which would > fix this issue. > > Thanks a lot > Narendra > > On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel > <[email protected]>wrote: > >> So in your step 2 you have the following: >> FOREACH row IN TABLE alpha: >> SELECT something >> FROM TABLE alpha >> WHERE alpha.url = row.url >> >> Right? >> And you are wondering why you are getting timeouts? >> ... >> ... >> And how long does it take to do a full table scan? ;-) >> (there's more, but that's the first thing you should see...) >> >> Try creating a second table where you invert the URL and key pair such >> that for each URL, you have a set of your alpha table's keys? >> >> Then you have the following... >> FOREACH row IN TABLE alpha: >> FETCH key-set FROM beta >> WHERE beta.rowkey = alpha.url >> >> Note I use FETCH to signify that you should get a single row in response. >> >> Does this make sense? >> ( your second table is actually and index of the URL column in your first >> table) >> >> HTH >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On Apr 19, 2012, at 5:43 AM, Narendra yadala <[email protected]> >> wrote: >> >>> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop >> (4*32 >>> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera distribution >>> for maintaining our cluster. I have a single tweets table in which we >> store >>> the tweets, one tweet per row (it has millions of rows currently). >>> >>> Now I try to run a Java batch (not a map reduce) which does the >> following : >>> >>> 1. Open a scanner over the tweet table and read the tweets one after >>> another. I set scanner caching to 128 rows as higher scanner caching is >>> leading to ScannerTimeoutExceptions. I scan over the first 10k rows >> only. >>> 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are there >>> in that tweet and open another scanner over the tweets table to see who >>> else shared that link. This involves getting rows having that URL from >> the >>> entire table (not first 10k rows). >>> 3. Do similar stuff as in step 2 for hashtags >>> (hashtagcolfamily:hashtagvalue). >>> 4. Do steps 1-3 in parallel for approximately 7-8 threads. This number >>> can be higher (thousands also) later. >>> >>> >>> When I run this batch I got the GC issue which is specified here >>> >> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ >>> Then I tried to turn on the MSLAB feature and changed the GC settings by >>> specifying -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags. >>> Even after doing this, I am running into all kinds of IOExceptions >>> and SocketTimeoutExceptions. >>> >>> This Java batch opens approximately 7*2 (14) scanners open at a point in >>> time and still I am running into all kinds of troubles. I am wondering >>> whether I can have thousands of parallel scanners with HBase when I need >> to >>> scale. >>> >>> It would be great to know whether I can open thousands/millions of >> scanners >>> in parallel with HBase efficiently. >>> >>> Thanks >>> Narendra >>
