More on Large scale Scans...

Vidhyashankar Venkataraman Wed, 28 Jul 2010 10:31:43 -0700

I reran my Scans on a 180 node cluster (with 160 RS's/data nodes/task trackers) 
using TableInputFormat with block cache disabled and I got the following 
results/observations:


 1.  90% of the job completed in roughly 5 and a half hours which comes to 29 
MBps per node (counting 160 RS's).. Is this reasonable? Can I get better 
results?

    2.  The last 10% of the job has still been running for more than 3 hours.. 
I am trying to find some common thread of reasoning why this is happening.

    3. I tried disabling the block cache in the MR job (I was using 
TableMapReduceUtil: see code snippet below) and through hbase shell just to 
change default options (disable the data and change using alter options)..

  conf.set(TableInputFormat.SCAN_CACHEBLOCKS, "false");
    Job job = new Job(conf, "Scan" + "_" + tableName);
    ..........
    Scan scan = new Scan();
    scan.addColumns("bigColumn");
    scan.setCaching(10);
    scan.setCacheBlocks(false);                      // Setting it again.. I 
guess this will override the previous setting..
    // Second argument is the table name.
    TableMapReduceUtil.initTableMapperJob(tableName, scan,
      MapperHB.class, Text.class, NullWritable.class, job);

But I still get LruBlockcache stats like the following ones: Is this okay?

2010-07-27 14:13:40,852 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1304510, 
Hit=1914, Miss=1302596, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.1467217574827373%, Miss Ratio=99.8532772064209%, 
Evicted/Run=238.943939208984382010-07-27 14:14:40,860 DEBUG 
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB 
(2516405504), Counts: Blocks=1914, Access=1306103, Hit=1914, Miss=1304189, 
Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14654280385002494%, Miss 
Ratio=99.85345602035522%, Evicted/Run=238.943939208984382010-07-27 14:15:40,853 
DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes: 
Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB 
(2516405504), Counts: Blocks=1914, Access=1307783, Hit=1914, Miss=1305869, 
Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14635456027463078%, Miss 
Ratio=99.8536467552185%, Evicted/Run=238.94393920898438
................
.................
2010-07-27 22:45:40,853 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348427, 
Hit=1914, Miss=1346513, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.14194316463544965%, Miss Ratio=99.85805749893188%, 
Evicted/Run=238.94393920898438
2010-07-27 22:46:40,854 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348427, 
Hit=1914, Miss=1346513, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.14194316463544965%, Miss Ratio=99.85805749893188%, 
Evicted/Run=238.94393920898438
2010-07-27 22:47:40,853 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348427, 
Hit=1914, Miss=1346513, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.14194316463544965%, Miss Ratio=99.85805749893188%, 
Evicted/Run=238.94393920898438
2010-07-27 22:48:40,853 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348427, 
Hit=1914, Miss=1346513, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.14194316463544965%, Miss Ratio=99.85805749893188%, 
Evicted/Run=238.94393920898438
2010-07-27 22:49:40,876 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348659, 
Hit=1914, Miss=1346745, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.14191875234246254%, Miss Ratio=99.8580813407898%, 
Evicted/Run=238.94393920898438
2010-07-27 22:50:41,140 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1349089, 
Hit=1914, Miss=1347175, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.1418735133484006%, Miss Ratio=99.85812902450562%, 
Evicted/Run=238.94393920898438


    4.   A minor number of region servers (8 out of 150) crash (more of a 
graceful shutdown) while performing the scan and I saw these messages just 
before the shutdown started:

2010-07-27 22:52:57,448 WARN org.apache.zookeeper.ClientCnxn: Exception closing 
session 0x2a0190c221002c to sun.nio.ch.selectionkeyi...@286e4365
java.io.IOException: TIMED OUT
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-07-27 22:52:57,454 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1349486, 
Hit=1914, Miss=1347572, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.14183177845552564%, Miss Ratio=99.85817074775696%, 
Evicted/Run=238.94393920898438
2010-07-27 22:52:57,454 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: 
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB 
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1349486, 
Hit=1914, Miss=1347572, Evictions=3300, Evicted=788515, Ratios: Hit 
Ratio=0.14183177845552564%, Miss Ratio=99.85817074775696%, 
Evicted/Run=238.94393920898438
2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
Responder, call next(-30527588457879943, 30) from 98.136.191.88:33914: output 
error
2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
Responder, call next(-7358804533352185527, 30) from 98.136.191.88:33963: output 
error
2010-07-27 22:52:57,602 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
Responder, call next(-3402818986992405350, 30) from 98.136.191.88:33927: output 
error
2010-07-27 22:52:57,603 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server 
Responder, call next(-6635348671997204137, 30) from 98.136.191.88:33916: output 
error
2010-07-27 22:52:57,603 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: 
Disconnected, type: None, path: null
2010-07-27 22:52:57,603 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server 
handler 21 on 60020 caught: java.nio.channels.ClosedChannelException
....................
.....................
2010-07-27 22:52:58,059 WARN 
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException

Thanks in advance,
Vidhya

On 7/26/10 2:43 PM, "Vidhyashankar Venkataraman" <[email protected]> wrote:

I am trying to assess the performance of Scans on a 100TB db on 180 nodes 
running Hbase 0.20.5..

I run a sharded scan (each Map task runs a scan on a specific range: 
speculative execution is turned false so that there is no duplication in tasks) 
on a fully compacted table...

1 MB block size, Block cache enabled.. Max of 2 tasks per node..  Each row is 
30 KB in size: 1 big column family with just one field..
Region lease timeout is set to an hour.. And I don't get any socket timeout 
exceptions so I have not reassigned the write socket timeout...

I ran experiments on the following cases:

 1.  The client level cache is set to 1 (default: got he number using 
getCaching): The MR tasks take around 13 hours to finish in the average.. Which 
gives around 13.17 MBps per node. The worst case is 34 hours (to finish the 
entire job)...
 2.  Client cache set to 20 rows: this is much worse than the previous case: we 
get around a super low 1MBps per node...

         Question: Should I set it to a value such that the block size is a 
multiple of the above said cache size? Or the cache size to a much lower value?

I find that these numbers are much less than the ones I get when it's running 
with just a few nodes..

Can you guys help me with this problem?

Thank you
Vidhya

More on Large scale Scans...

Reply via email to