If task tracker runs on such nodes, please reduce the maximum number of mappers / reducers.
On Wed, Jul 28, 2010 at 11:03 AM, Vidhyashankar Venkataraman < [email protected]> wrote: > Yeah I got it in some: > 2010-07-27 22:53:31,703 WARN org.apache.zookeeper.ClientCnxn: Exception > closing session 0x32a0190c21e0030 to sun.nio.ch.selectionkeyi...@7971f189 > java.io.IOException: TIMED OUT > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > 2010-07-27 22:53:31,937 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > Total=1847.005MB (1936725152), Free=552.8263MB (579680352), Max=2399.8313MB > (2516405504), Counts: Blocks=1811, Access=1332343, Hit=1811, Miss=1330532, > Evictions=3177, Evicted=759144, Ratios: Hit Ratio=0.13592595933005214%, Miss > Ratio=99.86407160758972%, Evicted/Run=238.949951171875 > 2010-07-27 22:53:31,936 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call next(-8452685834063022439, 30) from 98.136.191.84:57851: > output error > 2010-07-27 22:53:31,827 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call next(-5892917023728878793, 30) from 98.136.191.84:57817: > output error > 2010-07-27 22:53:31,998 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call next(-299811806597554323, 30) from 98.136.191.84:57840: > output error > 2010-07-27 22:53:31,822 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server > Responder, call next(-2595212217395389635, 30) from 98.136.191.84:57829: > output error > 2010-07-27 22:53:31,718 WARN org.apache.hadoop.hbase.util.Sleeper: We slept > 141975ms, ten times longer than scheduled: 10000 > 2010-07-27 22:53:32,144 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server > handler 39 on 60020 caught: java.nio.channels.ClosedChannelException > > Vidhya > > > On 7/28/10 10:51 AM, "Ted Yu" <[email protected]> wrote: > > For the region servers that went down, do you find the following in their > logs ? > > 2010-06-17 22:22:47,414 WARN org.apache.hadoop.hbase.util.Sleeper: We slept > 55448ms, ten times longer than scheduled: 3000 > > On Wed, Jul 28, 2010 at 10:30 AM, Vidhyashankar Venkataraman < > [email protected]> wrote: > > > I reran my Scans on a 180 node cluster (with 160 RS's/data nodes/task > > trackers) using TableInputFormat with block cache disabled and I got the > > following results/observations: > > > > 1. 90% of the job completed in roughly 5 and a half hours which comes > to > > 29 MBps per node (counting 160 RS's).. Is this reasonable? Can I get > better > > results? > > > > 2. The last 10% of the job has still been running for more than 3 > > hours.. I am trying to find some common thread of reasoning why this is > > happening. > > > > 3. I tried disabling the block cache in the MR job (I was using > > TableMapReduceUtil: see code snippet below) and through hbase shell just > to > > change default options (disable the data and change using alter > options).. > > > > conf.set(TableInputFormat.SCAN_CACHEBLOCKS, "false"); > > Job job = new Job(conf, "Scan" + "_" + tableName); > > .......... > > Scan scan = new Scan(); > > scan.addColumns("bigColumn"); > > scan.setCaching(10); > > scan.setCacheBlocks(false); // Setting it again.. > I > > guess this will override the previous setting.. > > // Second argument is the table name. > > TableMapReduceUtil.initTableMapperJob(tableName, scan, > > MapperHB.class, Text.class, NullWritable.class, job); > > > > But I still get LruBlockcache stats like the following ones: Is this > okay? > > > > 2010-07-27 14:13:40,852 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1304510, Hit=1914, > Miss=1302596, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.1467217574827373%, > Miss > > Ratio=99.8532772064209%, Evicted/Run=238.943939208984382010-07-27 > > 14:14:40,860 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache > > Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB > (471601336), > > Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1306103, > Hit=1914, > > Miss=1304189, Evictions=3300, Evicted=788515, Ratios: Hit > > Ratio=0.14654280385002494%, Miss Ratio=99.85345602035522%, > > Evicted/Run=238.943939208984382010-07-27 14:15:40,853 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1307783, Hit=1914, > Miss=1305869, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14635456027463078%, > Miss > > Ratio=99.8536467552185%, Evicted/Run=238.94393920898438 > > ................ > > ................. > > 2010-07-27 22:45:40,853 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, > Miss=1346513, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, > Miss > > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438 > > 2010-07-27 22:46:40,854 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, > Miss=1346513, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, > Miss > > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438 > > 2010-07-27 22:47:40,853 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, > Miss=1346513, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, > Miss > > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438 > > 2010-07-27 22:48:40,853 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914, > Miss=1346513, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%, > Miss > > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438 > > 2010-07-27 22:49:40,876 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1348659, Hit=1914, > Miss=1346745, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14191875234246254%, > Miss > > Ratio=99.8580813407898%, Evicted/Run=238.94393920898438 > > 2010-07-27 22:50:41,140 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1349089, Hit=1914, > Miss=1347175, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.1418735133484006%, > Miss > > Ratio=99.85812902450562%, Evicted/Run=238.94393920898438 > > > > > > 4. A minor number of region servers (8 out of 150) crash (more of a > > graceful shutdown) while performing the scan and I saw these messages > just > > before the shutdown started: > > > > 2010-07-27 22:52:57,448 WARN org.apache.zookeeper.ClientCnxn: Exception > > closing session 0x2a0190c221002c to sun.nio.ch.selectionkeyi...@286e4365 > > java.io.IOException: TIMED OUT > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906) > > 2010-07-27 22:52:57,454 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1349486, Hit=1914, > Miss=1347572, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14183177845552564%, > Miss > > Ratio=99.85817074775696%, Evicted/Run=238.94393920898438 > > 2010-07-27 22:52:57,454 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: > > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), > Max=2399.8313MB > > (2516405504), Counts: Blocks=1914, Access=1349486, Hit=1914, > Miss=1347572, > > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14183177845552564%, > Miss > > Ratio=99.85817074775696%, Evicted/Run=238.94393920898438 > > 2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server > > Responder, call next(-30527588457879943, 30) from 98.136.191.88:33914: > > output error > > 2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server > > Responder, call next(-7358804533352185527, 30) from 98.136.191.88:33963: > > output error > > 2010-07-27 22:52:57,602 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server > > Responder, call next(-3402818986992405350, 30) from 98.136.191.88:33927: > > output error > > 2010-07-27 22:52:57,603 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server > > Responder, call next(-6635348671997204137, 30) from 98.136.191.88:33916: > > output error > > 2010-07-27 22:52:57,603 INFO > > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, > > state: Disconnected, type: None, path: null > > 2010-07-27 22:52:57,603 INFO org.apache.hadoop.ipc.HBaseServer: IPC > Server > > handler 21 on 60020 caught: java.nio.channels.ClosedChannelException > > .................... > > ..................... > > 2010-07-27 22:52:58,059 WARN > > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1 > > org.apache.hadoop.hbase.Leases$LeaseStillHeldException > > > > Thanks in advance, > > Vidhya > > > > On 7/26/10 2:43 PM, "Vidhyashankar Venkataraman" <[email protected] > > > > wrote: > > > > I am trying to assess the performance of Scans on a 100TB db on 180 nodes > > running Hbase 0.20.5.. > > > > I run a sharded scan (each Map task runs a scan on a specific range: > > speculative execution is turned false so that there is no duplication in > > tasks) on a fully compacted table... > > > > 1 MB block size, Block cache enabled.. Max of 2 tasks per node.. Each > row > > is 30 KB in size: 1 big column family with just one field.. > > Region lease timeout is set to an hour.. And I don't get any socket > timeout > > exceptions so I have not reassigned the write socket timeout... > > > > I ran experiments on the following cases: > > > > 1. The client level cache is set to 1 (default: got he number using > > getCaching): The MR tasks take around 13 hours to finish in the average.. > > Which gives around 13.17 MBps per node. The worst case is 34 hours (to > > finish the entire job)... > > 2. Client cache set to 20 rows: this is much worse than the previous > > case: we get around a super low 1MBps per node... > > > > Question: Should I set it to a value such that the block size is > a > > multiple of the above said cache size? Or the cache size to a much lower > > value? > > > > I find that these numbers are much less than the ones I get when it's > > running with just a few nodes.. > > > > Can you guys help me with this problem? > > > > Thank you > > Vidhya > > > > > >
