Hi everyone,
I've launched a toy hadoop cluster using whirr on EC2 with the
following settings: http://pastebin.com/QpBBhjnb. As you can see I've
modified the default hdfs settings and I find them in the
hdfs-site.xml on each of my cluster nodes:
<configuration>
<property>
<name>dfs.block.size</name>
<value>134217728</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>40</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/data/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>1073741824</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/data/hadoop/hdfs/name</value>
</property>
<property>
<name>fs.checkpoint.dir</name>
<value>/data/hadoop/hdfs/secondary</value>
</property>
</configuration>
The problem is that these settings seems to be ignored, take a look at
a ls command output:
-rw-r--r-- 3 noiano supergroup 100014074 2012-02-03 11:42
/user/noiano/commoncr/1262851185117_0.arc.gz
-rw-r--r-- 3 noiano supergroup 100006118 2012-02-03 11:42
/user/noiano/commoncr/1262851189779_0.arc.gz
-rw-r--r-- 3 noiano supergroup 100006615 2012-02-03 11:43
/user/noiano/commoncr/1262851195054_0.arc.gz
Replication factor is 3, but there's more: FSCK reports
/user/noiano/commoncr/1262851185117_0.arc.gz 100014074 bytes, 2
block(s): Under replicated blk_-8776132147475805574_1192. Target
Replicas is 3 but found 2 replica(s).
Under replicated blk_-7884936399692653360_1197. Target Replicas is 3
but found 2 replica(s).
0. blk_-8776132147475805574_1192 len=67108864 repl=2
1. blk_-7884936399692653360_1197 len=32905210 repl=2
/user/noiano/commoncr/1262851189779_0.arc.gz 100006118 bytes, 2
block(s): Under replicated blk_-2551924706579916650_1199. Target
Replicas is 3 but found 2 replica(s).
Under replicated blk_3881085958984927530_1202. Target Replicas is 3
but found 2 replica(s).
0. blk_-2551924706579916650_1199 len=67108864 repl=2
1. blk_3881085958984927530_1202 len=32897254 repl=2
/user/noiano/commoncr/1262851195054_0.arc.gz 100006615 bytes, 2
block(s): Under replicated blk_8331213014551445027_1204. Target
Replicas is 3 but found 2 replica(s).
Under replicated blk_-8642619382276868802_1204. Target Replicas is 3
but found 2 replica(s).
0. blk_8331213014551445027_1204 len=67108864 repl=2
1. blk_-8642619382276868802_1204 len=32897751 repl=2
Status: HEALTHY
Total size: 14326036391 B
Total dirs: 1
Total files: 144
Total blocks (validated): 287 (avg. block size 49916503 B)
Minimally replicated blocks: 287 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 287 (100.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 1
Average block replication: 2.0
Corrupt blocks: 0
Missing replicas: 287 (50.0 %)
Number of data-nodes: 2
Number of racks: 1
FSCK ended at Fri Feb 03 11:49:53 UTC 2012 in 36 milliseconds
So even if a single file is less than the blocksize 134217728, it
occupies two blocks. That's really weird, don't you think? This is
probably the reason why distcp takes ages to complete...
Any ideas?
Thank you
Marco Didonna