That's really strange. Are you putting the files in the cluster with
distcp? Maybe you are affected by this:
https://issues.apache.org/jira/browse/HADOOP-1506

Have you tried to put  a single file with hadoop fs -put ... ? Same issues?

On Fri, Feb 3, 2012 at 1:59 PM, Marco Didonna <[email protected]> wrote:

> Hi everyone,
> I've launched a toy hadoop cluster using whirr on EC2 with the
> following settings: http://pastebin.com/QpBBhjnb. As you can see I've
> modified the default hdfs settings and I find them in the
> hdfs-site.xml on each of my cluster nodes:
>
> <configuration>
>  <property>
>    <name>dfs.block.size</name>
>    <value>134217728</value>
>  </property>
>  <property>
>    <name>dfs.replication</name>
>    <value>1</value>
>  </property>
>  <property>
>    <name>dfs.namenode.handler.count</name>
>    <value>40</value>
>  </property>
>  <property>
>    <name>dfs.data.dir</name>
>    <value>/data/hadoop/hdfs/data</value>
>  </property>
>  <property>
>    <name>dfs.datanode.du.reserved</name>
>    <value>1073741824</value>
>  </property>
>  <property>
>    <name>dfs.name.dir</name>
>    <value>/data/hadoop/hdfs/name</value>
>  </property>
>  <property>
>    <name>fs.checkpoint.dir</name>
>    <value>/data/hadoop/hdfs/secondary</value>
>  </property>
> </configuration>
>
> The problem is that these settings seems to be ignored, take a look at
> a ls command output:
>
>
> -rw-r--r--   3 noiano supergroup  100014074 2012-02-03 11:42
> /user/noiano/commoncr/1262851185117_0.arc.gz
> -rw-r--r--   3 noiano supergroup  100006118 2012-02-03 11:42
> /user/noiano/commoncr/1262851189779_0.arc.gz
> -rw-r--r--   3 noiano supergroup  100006615 2012-02-03 11:43
> /user/noiano/commoncr/1262851195054_0.arc.gz
>
> Replication factor is 3, but there's more: FSCK reports
>
> /user/noiano/commoncr/1262851185117_0.arc.gz 100014074 bytes, 2
> block(s):  Under replicated blk_-8776132147475805574_1192. Target
> Replicas is 3 but found 2 replica(s).
>  Under replicated blk_-7884936399692653360_1197. Target Replicas is 3
> but found 2 replica(s).
> 0. blk_-8776132147475805574_1192 len=67108864 repl=2
> 1. blk_-7884936399692653360_1197 len=32905210 repl=2
>
> /user/noiano/commoncr/1262851189779_0.arc.gz 100006118 bytes, 2
> block(s):  Under replicated blk_-2551924706579916650_1199. Target
> Replicas is 3 but found 2 replica(s).
>  Under replicated blk_3881085958984927530_1202. Target Replicas is 3
> but found 2 replica(s).
> 0. blk_-2551924706579916650_1199 len=67108864 repl=2
> 1. blk_3881085958984927530_1202 len=32897254 repl=2
>
> /user/noiano/commoncr/1262851195054_0.arc.gz 100006615 bytes, 2
> block(s):  Under replicated blk_8331213014551445027_1204. Target
> Replicas is 3 but found 2 replica(s).
>  Under replicated blk_-8642619382276868802_1204. Target Replicas is 3
> but found 2 replica(s).
> 0. blk_8331213014551445027_1204 len=67108864 repl=2
> 1. blk_-8642619382276868802_1204 len=32897751 repl=2
>
> Status: HEALTHY
>  Total size:    14326036391 B
>  Total dirs:    1
>  Total files:   144
>  Total blocks (validated):      287 (avg. block size 49916503 B)
>  Minimally replicated blocks:   287 (100.0 %)
>  Over-replicated blocks:        0 (0.0 %)
>  Under-replicated blocks:       287 (100.0 %)
>  Mis-replicated blocks:         0 (0.0 %)
>  Default replication factor:    1
>  Average block replication:     2.0
>  Corrupt blocks:                0
>  Missing replicas:              287 (50.0 %)
>  Number of data-nodes:          2
>  Number of racks:               1
> FSCK ended at Fri Feb 03 11:49:53 UTC 2012 in 36 milliseconds
>
> So even if a single file is less than the blocksize 134217728, it
> occupies two blocks. That's really weird, don't you think? This is
> probably the reason why distcp takes ages to complete...
>
> Any ideas?
>
> Thank you
>
> Marco Didonna
>

Reply via email to