For the region servers that went down, do you find the following in their logs ?
2010-06-17 22:22:47,414 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 55448ms, ten times longer than scheduled: 3000 On Wed, Jul 28, 2010 at 10:30 AM, Vidhyashankar Venkataraman < [email protected]> wrote: > I reran my Scans on a 180 node cluster (with 160 RS's/data nodes/task > trackers) using TableInputFormat with block cache disabled and I got the > following results/observations: > > 1. 90% of the job completed in roughly 5 and a half hours which comes to > 29 MBps per node (counting 160 RS's).. Is this reasonable? Can I get better > results? > > 2. The last 10% of the job has still been running for more than 3 > hours.. I am trying to find some common thread of reasoning why this is > happening. > > 3. I tried disabling the block cache in the MR job (I was using > TableMapReduceUtil: see code snippet below) and through hbase shell just to > change default options (disable the data and change using alter options).. > > conf.set(TableInputFormat.SCAN_CACHEBLOCKS, "false"); > Job job = new Job(conf, "Scan" + "_" + tableName); > .......... > Scan scan = new Scan(); > scan.addColumns("bigColumn"); > scan.setCaching(10); > scan.setCacheBlocks(false); // Setting it again.. I > guess this will override the previous setting.. > // Second argument is the table name. > TableMapReduceUtil.initTableMapperJob(tableName, scan, > MapperHB.class, Text.class, NullWritable.class, job); > > But I still get LruBlockcache stats like the following ones: Is this okay? > > 2010-07-27 14:13:40,852 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1304510, Hit=1914, Miss=1302596, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.1467217574827373%, Miss > Ratio=99.8532772064209%, Evicted/Run=238.943939208984382010-07-27 > 14:14:40,860 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache > Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1306103, Hit=1914, > Miss=1304189, Evictions=3300, Evicted=788515, Ratios: Hit > Ratio=0.14654280385002494%, Miss Ratio=99.85345602035522%, > Evicted/Run=238.943939208984382010-07-27 14:15:40,853 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1307783, Hit=1914, Miss=1305869, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14635456027463078%, Miss > Ratio=99.8536467552185%, Evicted/Run=238.94393920898438 > ................ > ................. > 2010-07-27 22:45:40,853 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, Miss=1346513, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, Miss > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438 > 2010-07-27 22:46:40,854 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, Miss=1346513, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, Miss > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438 > 2010-07-27 22:47:40,853 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, Miss=1346513, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, Miss > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438 > 2010-07-27 22:48:40,853 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, Miss=1346513, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, Miss > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438 > 2010-07-27 22:49:40,876 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1348659, Hit=1914, Miss=1346745, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14191875234246254%, Miss > Ratio=99.8580813407898%, Evicted/Run=238.94393920898438 > 2010-07-27 22:50:41,140 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1349089, Hit=1914, Miss=1347175, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.1418735133484006%, Miss > Ratio=99.85812902450562%, Evicted/Run=238.94393920898438 > > > 4. A minor number of region servers (8 out of 150) crash (more of a > graceful shutdown) while performing the scan and I saw these messages just > before the shutdown started: > > 2010-07-27 22:52:57,448 WARN org.apache.zookeeper.ClientCnxn: Exception > closing session 0x2a0190c221002c to sun.nio.ch.selectionkeyi...@286e4365 > java.io.IOException: TIMED OUT > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > 2010-07-27 22:52:57,454 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1349486, Hit=1914, Miss=1347572, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14183177845552564%, Miss > Ratio=99.85817074775696%, Evicted/Run=238.94393920898438 > 2010-07-27 22:52:57,454 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB > (2516405504), Counts: Blocks=1914, Access=1349486, Hit=1914, Miss=1347572, > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14183177845552564%, Miss > Ratio=99.85817074775696%, Evicted/Run=238.94393920898438 > 2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call next(-30527588457879943, 30) from 98.136.191.88:33914: > output error > 2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call next(-7358804533352185527, 30) from 98.136.191.88:33963: > output error > 2010-07-27 22:52:57,602 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call next(-3402818986992405350, 30) from 98.136.191.88:33927: > output error > 2010-07-27 22:52:57,603 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call next(-6635348671997204137, 30) from 98.136.191.88:33916: > output error > 2010-07-27 22:52:57,603 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > state: Disconnected, type: None, path: null > 2010-07-27 22:52:57,603 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 21 on 60020 caught: java.nio.channels.ClosedChannelException > .................... > ..................... > 2010-07-27 22:52:58,059 WARN > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > > Thanks in advance, > Vidhya > > On 7/26/10 2:43 PM, "Vidhyashankar Venkataraman" <[email protected]> > wrote: > > I am trying to assess the performance of Scans on a 100TB db on 180 nodes > running Hbase 0.20.5.. > > I run a sharded scan (each Map task runs a scan on a specific range: > speculative execution is turned false so that there is no duplication in > tasks) on a fully compacted table... > > 1 MB block size, Block cache enabled.. Max of 2 tasks per node.. Each row > is 30 KB in size: 1 big column family with just one field.. > Region lease timeout is set to an hour.. And I don't get any socket timeout > exceptions so I have not reassigned the write socket timeout... > > I ran experiments on the following cases: > > 1. The client level cache is set to 1 (default: got he number using > getCaching): The MR tasks take around 13 hours to finish in the average.. > Which gives around 13.17 MBps per node. The worst case is 34 hours (to > finish the entire job)... > 2. Client cache set to 20 rows: this is much worse than the previous > case: we get around a super low 1MBps per node... > > Question: Should I set it to a value such that the block size is a > multiple of the above said cache size? Or the cache size to a much lower > value? > > I find that these numbers are much less than the ones I get when it's > running with just a few nodes.. > > Can you guys help me with this problem? > > Thank you > Vidhya > >
