No problem. One of the hardest things to do is to try to be open to other design ideas and not become wedded to one.
I think once you get that working you can start to look at your cluster. On Apr 19, 2012, at 1:26 PM, Narendra yadala wrote: > Michael, > > I will do the redesign and build the index. Thanks a lot for the insights. > > Narendra > > On Thu, Apr 19, 2012 at 9:56 PM, Michael Segel > <michael_se...@hotmail.com>wrote: > >> Narendra, >> >> I think you are still missing the point. >> 130 seconds to scan the table per iteration. >> Even if you have 10K rows >> 130 * 10^4 or 1.3*10^6 seconds. ~361 hours >> >> Compare that to 10K rows where you then select a single row in your sub >> select that has a list of all of the associated rows. >> You can then do n number of get()s based on the data in the index. (If >> the data wasn't in the index itself) >> >> Assuming that the data was in the index, that's one get(). This is sub >> second. >> Just to keep things simple assume 1 second. >> That's 10K seconds vs 1.3 million seconds. (2 hours vs 361hours) >> Actually its more like 10ms so its 100 seconds to run your code. (So its >> like 2 minutes or so) >> >> Also since you're doing less work, you put less strain on the system. >> >> Look, you're asking for help. You're fighting to maintain a bad design. >> Building the index table shouldn't take you more than a day to think, >> design and implement. >> >> So you tell me, 2 minutes vs 361 hours. Which would you choose? >> >> HTH >> >> -Mike >> >> >> On Apr 19, 2012, at 10:04 AM, Narendra yadala wrote: >> >>> Michael, >>> >>> Thanks for the response. This is a real problem and not a class project. >>> Boxes itself costed 9k ;) >>> >>> I think there is some difference in understanding of the problem. The >> table >>> has 2m rows but I am looking at the latest 10k rows only in the outer for >>> loop. Only in the inner for loop i am trying to get all rows that contain >>> the url that is given by the row in the outer for loop. So pseudo code is >>> like this >>> >>> All scanners have a caching of 128. >>> >>> Scanner outerScanner = tweetTable.getScanner(new Scan()); //This gets >> the >>> entire row >>> for (int index = 0; index < 10000; index++) { >>> Result tweet = outerScanner.next(); >>> NavigableMap<byte[],byte[]> linkFamilyMap = >>> tweet.getFamilyMap(Bytes.toBytes("link")); >>> String url = Bytes.toString( linkFamilyMap.firstKey()); //assuming only >>> one link is there in the tweet. >>> Scan linkScan = new Scan(); >>> linkScan.addColumn(Bytes.toBytes("link"), Bytes.toBytes(url)); //get only >>> the link column family >>> Scanner linkScanner = tweetTable.getScanner(linkScan); //ideally this for >>> loop is taking 2 sec per sc >>> for (Result linkResult = linkScanner.next(); linkResult != null; >>> linkResult = linkScanner.next()) { >>> //do something with the link >>> } >>> linkScanner.close(); >>> >>> //do a similar for loop for hashtags >>> } >>> >>> Each of my inner for loop is taking around 20 seconds (or more depending >> on >>> number of rows returned by that particular scanner) for each of the 10k >>> rows that I am processing and this is also triggering a lot of GC in >> turn. >>> So it is 10000*40 seconds (4 days) for each thread. But the problem is >> that >>> the batch process crashes before completion throwing IOException and >>> SocketTimeoutException and sometimes GC OutOfMemory exceptions. >>> >>> I will definitely take the much elegant approach that you mentioned >>> eventually. I just wanted to get to the core of the issue before choosing >>> the solution. >>> >>> Thanks again. >>> Narendra >>> >>> On Thu, Apr 19, 2012 at 7:42 PM, Michel Segel <michael_se...@hotmail.com >>> wrote: >>> >>>> Narendra, >>>> >>>> Are you trying to solve a real problem, or is this a class project? >>>> >>>> Your solution doesn't scale. It's a non starter. 130 seconds for each >>>> iteration times 1 million seconds is how long? 130 million seconds, >> which >>>> is ~36000 hours or over 4 years to complete. >>>> (the numbers are rough but you get the idea...) >>>> >>>> That's assuming that your table is static and doesn't change. >>>> >>>> I didn't even ask if you were attempting any sort of server side >> filtering >>>> which would reduce the amount of data you send back to the client >> because >>>> it a moot point. >>>> >>>> Finer tuning is also moot. >>>> >>>> So you insert a row in one table. You then do n^2 operations to pull out >>>> data. >>>> The better solution is to insert data into 2 tables where you then have >> to >>>> do 2n operations to get the same results. Thats per thread btw. So if >> you >>>> were running 10 threads, you would have 10n^2 operations versus 20n >>>> operations to get the same result set. >>>> >>>> A million row table... 1*10^13. Vs 2*10^6 >>>> >>>> I don't believe I mentioned anything about HBase's internals and this >>>> solution works for any NoSQL database. >>>> >>>> >>>> Sent from a remote device. Please excuse any typos... >>>> >>>> Mike Segel >>>> >>>> On Apr 19, 2012, at 7:03 AM, Narendra yadala <narendra.yad...@gmail.com >>> >>>> wrote: >>>> >>>>> Hi Michel >>>>> >>>>> Yes, that is exactly what I do in step 2. I am aware of the reason for >>>> the >>>>> scanner timeout exceptions. It is the time between two consecutive >>>>> invocations of the next call on a specific scanner object. I increased >>>> the >>>>> scanner timeout to 10 min on the region server and still I keep seeing >>>> the >>>>> timeouts. So I reduced my scanner cache to 128. >>>>> >>>>> Full table scan takes 130 seconds and there are 2.2 million rows in the >>>>> table as of now. Each row is around 2 KB in size. I measured time for >> the >>>>> full table scan by issuing `count` command from the hbase shell. >>>>> >>>>> I kind of understood the fix that you are specifying, but do I need to >>>>> change the table structure to fix this problem? All I do is a n^2 >>>> operation >>>>> and even that fails with 10 different types of exceptions. It is mildly >>>>> annoying that I need to know all the low level storage details of HBase >>>> to >>>>> do such a simple operation. And this is happening for just 14 parallel >>>>> scanners. I am wondering what would happen when there are thousands of >>>>> parallel scanners. >>>>> >>>>> Please let me know if there is any configuration param change which >> would >>>>> fix this issue. >>>>> >>>>> Thanks a lot >>>>> Narendra >>>>> >>>>> On Thu, Apr 19, 2012 at 4:40 PM, Michel Segel < >> michael_se...@hotmail.com >>>>> wrote: >>>>> >>>>>> So in your step 2 you have the following: >>>>>> FOREACH row IN TABLE alpha: >>>>>> SELECT something >>>>>> FROM TABLE alpha >>>>>> WHERE alpha.url = row.url >>>>>> >>>>>> Right? >>>>>> And you are wondering why you are getting timeouts? >>>>>> ... >>>>>> ... >>>>>> And how long does it take to do a full table scan? ;-) >>>>>> (there's more, but that's the first thing you should see...) >>>>>> >>>>>> Try creating a second table where you invert the URL and key pair such >>>>>> that for each URL, you have a set of your alpha table's keys? >>>>>> >>>>>> Then you have the following... >>>>>> FOREACH row IN TABLE alpha: >>>>>> FETCH key-set FROM beta >>>>>> WHERE beta.rowkey = alpha.url >>>>>> >>>>>> Note I use FETCH to signify that you should get a single row in >>>> response. >>>>>> >>>>>> Does this make sense? >>>>>> ( your second table is actually and index of the URL column in your >>>> first >>>>>> table) >>>>>> >>>>>> HTH >>>>>> >>>>>> Sent from a remote device. Please excuse any typos... >>>>>> >>>>>> Mike Segel >>>>>> >>>>>> On Apr 19, 2012, at 5:43 AM, Narendra yadala < >> narendra.yad...@gmail.com >>>>> >>>>>> wrote: >>>>>> >>>>>>> I have an issue with my HBase cluster. We have a 4 node HBase/Hadoop >>>>>> (4*32 >>>>>>> GB RAM and 4*6 TB disk space) cluster. We are using Cloudera >>>> distribution >>>>>>> for maintaining our cluster. I have a single tweets table in which we >>>>>> store >>>>>>> the tweets, one tweet per row (it has millions of rows currently). >>>>>>> >>>>>>> Now I try to run a Java batch (not a map reduce) which does the >>>>>> following : >>>>>>> >>>>>>> 1. Open a scanner over the tweet table and read the tweets one after >>>>>>> another. I set scanner caching to 128 rows as higher scanner caching >>>> is >>>>>>> leading to ScannerTimeoutExceptions. I scan over the first 10k rows >>>>>> only. >>>>>>> 2. For each tweet, extract URLs (linkcolfamily:urlvalue) that are >>>> there >>>>>>> in that tweet and open another scanner over the tweets table to see >>>> who >>>>>>> else shared that link. This involves getting rows having that URL >> from >>>>>> the >>>>>>> entire table (not first 10k rows). >>>>>>> 3. Do similar stuff as in step 2 for hashtags >>>>>>> (hashtagcolfamily:hashtagvalue). >>>>>>> 4. Do steps 1-3 in parallel for approximately 7-8 threads. This >> number >>>>>>> can be higher (thousands also) later. >>>>>>> >>>>>>> >>>>>>> When I run this batch I got the GC issue which is specified here >>>>>>> >>>>>> >>>> >> http://www.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/ >>>>>>> Then I tried to turn on the MSLAB feature and changed the GC settings >>>> by >>>>>>> specifying -XX:+UseParNewGC and -XX:+UseConcMarkSweepGC JVM flags. >>>>>>> Even after doing this, I am running into all kinds of IOExceptions >>>>>>> and SocketTimeoutExceptions. >>>>>>> >>>>>>> This Java batch opens approximately 7*2 (14) scanners open at a point >>>> in >>>>>>> time and still I am running into all kinds of troubles. I am >> wondering >>>>>>> whether I can have thousands of parallel scanners with HBase when I >>>> need >>>>>> to >>>>>>> scale. >>>>>>> >>>>>>> It would be great to know whether I can open thousands/millions of >>>>>> scanners >>>>>>> in parallel with HBase efficiently. >>>>>>> >>>>>>> Thanks >>>>>>> Narendra >>>>>> >>>> >> >>