Re: More on Large scale Scans...

Ted Yu Wed, 28 Jul 2010 11:44:50 -0700

If task tracker runs on such nodes, please reduce the maximum number of
mappers / reducers.


On Wed, Jul 28, 2010 at 11:03 AM, Vidhyashankar Venkataraman <
[email protected]> wrote:

> Yeah I got it in some:
> 2010-07-27 22:53:31,703 WARN org.apache.zookeeper.ClientCnxn: Exception
> closing session 0x32a0190c21e0030 to sun.nio.ch.selectionkeyi...@7971f189
> java.io.IOException: TIMED OUT
>        at
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> 2010-07-27 22:53:31,937 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> Total=1847.005MB (1936725152), Free=552.8263MB (579680352), Max=2399.8313MB
> (2516405504), Counts: Blocks=1811, Access=1332343, Hit=1811, Miss=1330532,
> Evictions=3177, Evicted=759144, Ratios: Hit Ratio=0.13592595933005214%, Miss
> Ratio=99.86407160758972%, Evicted/Run=238.949951171875
> 2010-07-27 22:53:31,936 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call next(-8452685834063022439, 30) from 98.136.191.84:57851:
> output error
> 2010-07-27 22:53:31,827 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call next(-5892917023728878793, 30) from 98.136.191.84:57817:
> output error
> 2010-07-27 22:53:31,998 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call next(-299811806597554323, 30) from 98.136.191.84:57840:
> output error
> 2010-07-27 22:53:31,822 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> Responder, call next(-2595212217395389635, 30) from 98.136.191.84:57829:
> output error
> 2010-07-27 22:53:31,718 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 141975ms, ten times longer than scheduled: 10000
> 2010-07-27 22:53:32,144 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 39 on 60020 caught: java.nio.channels.ClosedChannelException
>
> Vidhya
>
>
> On 7/28/10 10:51 AM, "Ted Yu" <[email protected]> wrote:
>
> For the region servers that went down, do you find the following in their
> logs ?
>
> 2010-06-17 22:22:47,414 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 55448ms, ten times longer than scheduled: 3000
>
> On Wed, Jul 28, 2010 at 10:30 AM, Vidhyashankar Venkataraman <
> [email protected]> wrote:
>
> > I reran my Scans on a 180 node cluster (with 160 RS's/data nodes/task
> > trackers) using TableInputFormat with block cache disabled and I got the
> > following results/observations:
> >
> >  1.  90% of the job completed in roughly 5 and a half hours which comes
> to
> > 29 MBps per node (counting 160 RS's).. Is this reasonable? Can I get
> better
> > results?
> >
> >    2.  The last 10% of the job has still been running for more than 3
> > hours.. I am trying to find some common thread of reasoning why this is
> > happening.
> >
> >    3. I tried disabling the block cache in the MR job (I was using
> > TableMapReduceUtil: see code snippet below) and through hbase shell just
> to
> > change default options (disable the data and change using alter
> options)..
> >
> >  conf.set(TableInputFormat.SCAN_CACHEBLOCKS, "false");
> >    Job job = new Job(conf, "Scan" + "_" + tableName);
> >    ..........
> >    Scan scan = new Scan();
> >    scan.addColumns("bigColumn");
> >    scan.setCaching(10);
> >    scan.setCacheBlocks(false);                      // Setting it again..
> I
> > guess this will override the previous setting..
> >    // Second argument is the table name.
> >    TableMapReduceUtil.initTableMapperJob(tableName, scan,
> >      MapperHB.class, Text.class, NullWritable.class, job);
> >
> > But I still get LruBlockcache stats like the following ones: Is this
> okay?
> >
> > 2010-07-27 14:13:40,852 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1304510, Hit=1914,
> Miss=1302596,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.1467217574827373%,
> Miss
> > Ratio=99.8532772064209%, Evicted/Run=238.943939208984382010-07-27
> > 14:14:40,860 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache
> > Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
> (471601336),
> > Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1306103,
> Hit=1914,
> > Miss=1304189, Evictions=3300, Evicted=788515, Ratios: Hit
> > Ratio=0.14654280385002494%, Miss Ratio=99.85345602035522%,
> > Evicted/Run=238.943939208984382010-07-27 14:15:40,853 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1307783, Hit=1914,
> Miss=1305869,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14635456027463078%,
> Miss
> > Ratio=99.8536467552185%, Evicted/Run=238.94393920898438
> > ................
> > .................
> > 2010-07-27 22:45:40,853 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914,
> Miss=1346513,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%,
> Miss
> > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438
> > 2010-07-27 22:46:40,854 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914,
> Miss=1346513,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%,
> Miss
> > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438
> > 2010-07-27 22:47:40,853 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914,
> Miss=1346513,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%,
> Miss
> > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438
> > 2010-07-27 22:48:40,853 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1348427, Hit=1914,
> Miss=1346513,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14194316463544965%,
> Miss
> > Ratio=99.85805749893188%, Evicted/Run=238.94393920898438
> > 2010-07-27 22:49:40,876 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1348659, Hit=1914,
> Miss=1346745,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14191875234246254%,
> Miss
> > Ratio=99.8580813407898%, Evicted/Run=238.94393920898438
> > 2010-07-27 22:50:41,140 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1349089, Hit=1914,
> Miss=1347175,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.1418735133484006%,
> Miss
> > Ratio=99.85812902450562%, Evicted/Run=238.94393920898438
> >
> >
> >    4.   A minor number of region servers (8 out of 150) crash (more of a
> > graceful shutdown) while performing the scan and I saw these messages
> just
> > before the shutdown started:
> >
> > 2010-07-27 22:52:57,448 WARN org.apache.zookeeper.ClientCnxn: Exception
> > closing session 0x2a0190c221002c to sun.nio.ch.selectionkeyi...@286e4365
> > java.io.IOException: TIMED OUT
> >        at
> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
> > 2010-07-27 22:52:57,454 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1349486, Hit=1914,
> Miss=1347572,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14183177845552564%,
> Miss
> > Ratio=99.85817074775696%, Evicted/Run=238.94393920898438
> > 2010-07-27 22:52:57,454 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
> > Total=1950.0773MB (2044804168), Free=449.7541MB (471601336),
> Max=2399.8313MB
> > (2516405504), Counts: Blocks=1914, Access=1349486, Hit=1914,
> Miss=1347572,
> > Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14183177845552564%,
> Miss
> > Ratio=99.85817074775696%, Evicted/Run=238.94393920898438
> > 2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > Responder, call next(-30527588457879943, 30) from 98.136.191.88:33914:
> > output error
> > 2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > Responder, call next(-7358804533352185527, 30) from 98.136.191.88:33963:
> > output error
> > 2010-07-27 22:52:57,602 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > Responder, call next(-3402818986992405350, 30) from 98.136.191.88:33927:
> > output error
> > 2010-07-27 22:52:57,603 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > Responder, call next(-6635348671997204137, 30) from 98.136.191.88:33916:
> > output error
> > 2010-07-27 22:52:57,603 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> > state: Disconnected, type: None, path: null
> > 2010-07-27 22:52:57,603 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 21 on 60020 caught: java.nio.channels.ClosedChannelException
> > ....................
> > .....................
> > 2010-07-27 22:52:58,059 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
> > org.apache.hadoop.hbase.Leases$LeaseStillHeldException
> >
> > Thanks in advance,
> > Vidhya
> >
> > On 7/26/10 2:43 PM, "Vidhyashankar Venkataraman" <[email protected]
> >
> > wrote:
> >
> > I am trying to assess the performance of Scans on a 100TB db on 180 nodes
> > running Hbase 0.20.5..
> >
> > I run a sharded scan (each Map task runs a scan on a specific range:
> > speculative execution is turned false so that there is no duplication in
> > tasks) on a fully compacted table...
> >
> > 1 MB block size, Block cache enabled.. Max of 2 tasks per node..  Each
> row
> > is 30 KB in size: 1 big column family with just one field..
> > Region lease timeout is set to an hour.. And I don't get any socket
> timeout
> > exceptions so I have not reassigned the write socket timeout...
> >
> > I ran experiments on the following cases:
> >
> >  1.  The client level cache is set to 1 (default: got he number using
> > getCaching): The MR tasks take around 13 hours to finish in the average..
> > Which gives around 13.17 MBps per node. The worst case is 34 hours (to
> > finish the entire job)...
> >  2.  Client cache set to 20 rows: this is much worse than the previous
> > case: we get around a super low 1MBps per node...
> >
> >         Question: Should I set it to a value such that the block size is
> a
> > multiple of the above said cache size? Or the cache size to a much lower
> > value?
> >
> > I find that these numbers are much less than the ones I get when it's
> > running with just a few nodes..
> >
> > Can you guys help me with this problem?
> >
> > Thank you
> > Vidhya
> >
> >
>
>

Re: More on Large scale Scans...

Reply via email to