Hey, That sounds interesting - maybe you could tell us about why your system is better performing? The default TableInputFormat is just creating N map tasks, one for each region, which are all roughly the same data-size.
What do you do? -ryan On Mon, Jul 26, 2010 at 3:29 PM, Xavier Stevens <[email protected]> wrote: > We have something that might interest you. > > http://socorro.googlecode.com/svn/trunk/analysis/src/java/org/apache/hadoop/hbase/mapreduce/ > > We haven't fully tested everything yet, so don't blame us if something > goes wrong. It's basically the exact same as TableInputFormat except it > takes an array of Scans to be used for row-key ranges. It requires the > caller to setup the Scan array since they should have the best knowledge > about their row-key structure. > > Preliminary results for us reduced a 15 minute job to under 2 minutes. > > Cheers, > > > -Xavier > > On 7/26/10 3:16 PM, Vidhyashankar Venkataraman wrote: >> I did not use a TableInputFormat: I ran my own scans on specific ranges >> (just for more control from my side to tune the ranges and the ease with >> which I can run a hadoop streaming job).. >> >> 1 MB for Hfile Block size.. Not the HDFS block size.. >> I increased it since I didn't care too much for random read performance.. >> HDFS block size is the default value... (I have a related question then: >> does the Hfile block size influence only the size of the index and the >> efficiency of random reads? I don't see an effect on scans though)... >> >> I had previously run 5 tasks per machine and at 20 rows, but that resulted >> in scanner expiries (UnknownScannerexception) and DFS socket timeouts.. So >> that's why I reduced the number of tasks.. Let me decrease the number of >> rows and see.. >> >> Just to make sure: the client uses zookeeper only for obtaining ROOT right >> whenever it performs scans, isnt it? So scans shouldn't face any master/zk >> bottlenecks when we scale up wrt number of nodes, am I right? >> >> Thank you >> Vidhya >> >> On 7/26/10 3:01 PM, "Ryan Rawson" <[email protected]> wrote: >> >> Hey, >> >> A few questions: >> >> - sharded scan, are you not using TableInputFormat? >> - 1 MB block size - what block size? You probably shouldnt set the >> HDFS block size to 1MB, it just causes more nn traffic. >> - Tests a year ago indicated that HFile block size really didnt >> improve speed when you went beyond 64k or so. >> - Run more maps/machine... one map task per disk probably? >> - Try setting the client cache to an in-between level, 2-6 perhaps. >> >> Let us know about those other questions and we can go from there. >> -ryan >> >> On Mon, Jul 26, 2010 at 2:43 PM, Vidhyashankar Venkataraman >> <[email protected]> wrote: >>> I am trying to assess the performance of Scans on a 100TB db on 180 nodes >>> running Hbase 0.20.5.. >>> >>> I run a sharded scan (each Map task runs a scan on a specific range: >>> speculative execution is turned false so that there is no duplication in >>> tasks) on a fully compacted table... >>> >>> 1 MB block size, Block cache enabled.. Max of 2 tasks per node.. Each row >>> is 30 KB in size: 1 big column family with just one field.. >>> Region lease timeout is set to an hour.. And I don't get any socket timeout >>> exceptions so I have not reassigned the write socket timeout... >>> >>> I ran experiments on the following cases: >>> >>> 1. The client level cache is set to 1 (default: got he number using >>> getCaching): The MR tasks take around 13 hours to finish in the average.. >>> Which gives around 13.17 MBps per node. The worst case is 34 hours (to >>> finish the entire job)... >>> 2. Client cache set to 20 rows: this is much worse than the previous >>> case: we get around a super low 1MBps per node... >>> >>> Question: Should I set it to a value such that the block size is a >>> multiple of the above said cache size? Or the cache size to a much lower >>> value? >>> >>> I find that these numbers are much less than the ones I get when it's >>> running with just a few nodes.. >>> >>> Can you guys help me with this problem? >>> >>> Thank you >>> Vidhya >>> >> >
