I reran my Scans on a 180 node cluster (with 160 RS's/data nodes/task trackers)
using TableInputFormat with block cache disabled and I got the following
results/observations:
1. 90% of the job completed in roughly 5 and a half hours which comes to 29
MBps per node (counting 160 RS's).. Is this reasonable? Can I get better
results?
2. The last 10% of the job has still been running for more than 3 hours..
I am trying to find some common thread of reasoning why this is happening.
3. I tried disabling the block cache in the MR job (I was using
TableMapReduceUtil: see code snippet below) and through hbase shell just to
change default options (disable the data and change using alter options)..
conf.set(TableInputFormat.SCAN_CACHEBLOCKS, "false");
Job job = new Job(conf, "Scan" + "_" + tableName);
..........
Scan scan = new Scan();
scan.addColumns("bigColumn");
scan.setCaching(10);
scan.setCacheBlocks(false); // Setting it again.. I
guess this will override the previous setting..
// Second argument is the table name.
TableMapReduceUtil.initTableMapperJob(tableName, scan,
MapperHB.class, Text.class, NullWritable.class, job);
But I still get LruBlockcache stats like the following ones: Is this okay?
2010-07-27 14:13:40,852 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1304510,
Hit=1914, Miss=1302596, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.1467217574827373%, Miss Ratio=99.8532772064209%,
Evicted/Run=238.943939208984382010-07-27 14:14:40,860 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
(2516405504), Counts: Blocks=1914, Access=1306103, Hit=1914, Miss=1304189,
Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14654280385002494%, Miss
Ratio=99.85345602035522%, Evicted/Run=238.943939208984382010-07-27 14:15:40,853
DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats: Sizes:
Total=1950.0773MB (2044804168), Free=449.7541MB (471601336), Max=2399.8313MB
(2516405504), Counts: Blocks=1914, Access=1307783, Hit=1914, Miss=1305869,
Evictions=3300, Evicted=788515, Ratios: Hit Ratio=0.14635456027463078%, Miss
Ratio=99.8536467552185%, Evicted/Run=238.94393920898438
................
.................
2010-07-27 22:45:40,853 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348427,
Hit=1914, Miss=1346513, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.14194316463544965%, Miss Ratio=99.85805749893188%,
Evicted/Run=238.94393920898438
2010-07-27 22:46:40,854 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348427,
Hit=1914, Miss=1346513, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.14194316463544965%, Miss Ratio=99.85805749893188%,
Evicted/Run=238.94393920898438
2010-07-27 22:47:40,853 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348427,
Hit=1914, Miss=1346513, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.14194316463544965%, Miss Ratio=99.85805749893188%,
Evicted/Run=238.94393920898438
2010-07-27 22:48:40,853 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348427,
Hit=1914, Miss=1346513, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.14194316463544965%, Miss Ratio=99.85805749893188%,
Evicted/Run=238.94393920898438
2010-07-27 22:49:40,876 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1348659,
Hit=1914, Miss=1346745, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.14191875234246254%, Miss Ratio=99.8580813407898%,
Evicted/Run=238.94393920898438
2010-07-27 22:50:41,140 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1349089,
Hit=1914, Miss=1347175, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.1418735133484006%, Miss Ratio=99.85812902450562%,
Evicted/Run=238.94393920898438
4. A minor number of region servers (8 out of 150) crash (more of a
graceful shutdown) while performing the scan and I saw these messages just
before the shutdown started:
2010-07-27 22:52:57,448 WARN org.apache.zookeeper.ClientCnxn: Exception closing
session 0x2a0190c221002c to sun.nio.ch.selectionkeyi...@286e4365
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
2010-07-27 22:52:57,454 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1349486,
Hit=1914, Miss=1347572, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.14183177845552564%, Miss Ratio=99.85817074775696%,
Evicted/Run=238.94393920898438
2010-07-27 22:52:57,454 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache:
Cache Stats: Sizes: Total=1950.0773MB (2044804168), Free=449.7541MB
(471601336), Max=2399.8313MB (2516405504), Counts: Blocks=1914, Access=1349486,
Hit=1914, Miss=1347572, Evictions=3300, Evicted=788515, Ratios: Hit
Ratio=0.14183177845552564%, Miss Ratio=99.85817074775696%,
Evicted/Run=238.94393920898438
2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call next(-30527588457879943, 30) from 98.136.191.88:33914: output
error
2010-07-27 22:52:57,548 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call next(-7358804533352185527, 30) from 98.136.191.88:33963: output
error
2010-07-27 22:52:57,602 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call next(-3402818986992405350, 30) from 98.136.191.88:33927: output
error
2010-07-27 22:52:57,603 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
Responder, call next(-6635348671997204137, 30) from 98.136.191.88:33916: output
error
2010-07-27 22:52:57,603 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state:
Disconnected, type: None, path: null
2010-07-27 22:52:57,603 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 21 on 60020 caught: java.nio.channels.ClosedChannelException
....................
.....................
2010-07-27 22:52:58,059 WARN
org.apache.hadoop.hbase.regionserver.HRegionServer: Attempt=1
org.apache.hadoop.hbase.Leases$LeaseStillHeldException
Thanks in advance,
Vidhya
On 7/26/10 2:43 PM, "Vidhyashankar Venkataraman" <[email protected]> wrote:
I am trying to assess the performance of Scans on a 100TB db on 180 nodes
running Hbase 0.20.5..
I run a sharded scan (each Map task runs a scan on a specific range:
speculative execution is turned false so that there is no duplication in tasks)
on a fully compacted table...
1 MB block size, Block cache enabled.. Max of 2 tasks per node.. Each row is
30 KB in size: 1 big column family with just one field..
Region lease timeout is set to an hour.. And I don't get any socket timeout
exceptions so I have not reassigned the write socket timeout...
I ran experiments on the following cases:
1. The client level cache is set to 1 (default: got he number using
getCaching): The MR tasks take around 13 hours to finish in the average.. Which
gives around 13.17 MBps per node. The worst case is 34 hours (to finish the
entire job)...
2. Client cache set to 20 rows: this is much worse than the previous case: we
get around a super low 1MBps per node...
Question: Should I set it to a value such that the block size is a
multiple of the above said cache size? Or the cache size to a much lower value?
I find that these numbers are much less than the ones I get when it's running
with just a few nodes..
Can you guys help me with this problem?
Thank you
Vidhya