That's really strange. Are you putting the files in the cluster with distcp? Maybe you are affected by this: https://issues.apache.org/jira/browse/HADOOP-1506
Have you tried to put a single file with hadoop fs -put ... ? Same issues? On Fri, Feb 3, 2012 at 1:59 PM, Marco Didonna <[email protected]> wrote: > Hi everyone, > I've launched a toy hadoop cluster using whirr on EC2 with the > following settings: http://pastebin.com/QpBBhjnb. As you can see I've > modified the default hdfs settings and I find them in the > hdfs-site.xml on each of my cluster nodes: > > <configuration> > <property> > <name>dfs.block.size</name> > <value>134217728</value> > </property> > <property> > <name>dfs.replication</name> > <value>1</value> > </property> > <property> > <name>dfs.namenode.handler.count</name> > <value>40</value> > </property> > <property> > <name>dfs.data.dir</name> > <value>/data/hadoop/hdfs/data</value> > </property> > <property> > <name>dfs.datanode.du.reserved</name> > <value>1073741824</value> > </property> > <property> > <name>dfs.name.dir</name> > <value>/data/hadoop/hdfs/name</value> > </property> > <property> > <name>fs.checkpoint.dir</name> > <value>/data/hadoop/hdfs/secondary</value> > </property> > </configuration> > > The problem is that these settings seems to be ignored, take a look at > a ls command output: > > > -rw-r--r-- 3 noiano supergroup 100014074 2012-02-03 11:42 > /user/noiano/commoncr/1262851185117_0.arc.gz > -rw-r--r-- 3 noiano supergroup 100006118 2012-02-03 11:42 > /user/noiano/commoncr/1262851189779_0.arc.gz > -rw-r--r-- 3 noiano supergroup 100006615 2012-02-03 11:43 > /user/noiano/commoncr/1262851195054_0.arc.gz > > Replication factor is 3, but there's more: FSCK reports > > /user/noiano/commoncr/1262851185117_0.arc.gz 100014074 bytes, 2 > block(s): Under replicated blk_-8776132147475805574_1192. Target > Replicas is 3 but found 2 replica(s). > Under replicated blk_-7884936399692653360_1197. Target Replicas is 3 > but found 2 replica(s). > 0. blk_-8776132147475805574_1192 len=67108864 repl=2 > 1. blk_-7884936399692653360_1197 len=32905210 repl=2 > > /user/noiano/commoncr/1262851189779_0.arc.gz 100006118 bytes, 2 > block(s): Under replicated blk_-2551924706579916650_1199. Target > Replicas is 3 but found 2 replica(s). > Under replicated blk_3881085958984927530_1202. Target Replicas is 3 > but found 2 replica(s). > 0. blk_-2551924706579916650_1199 len=67108864 repl=2 > 1. blk_3881085958984927530_1202 len=32897254 repl=2 > > /user/noiano/commoncr/1262851195054_0.arc.gz 100006615 bytes, 2 > block(s): Under replicated blk_8331213014551445027_1204. Target > Replicas is 3 but found 2 replica(s). > Under replicated blk_-8642619382276868802_1204. Target Replicas is 3 > but found 2 replica(s). > 0. blk_8331213014551445027_1204 len=67108864 repl=2 > 1. blk_-8642619382276868802_1204 len=32897751 repl=2 > > Status: HEALTHY > Total size: 14326036391 B > Total dirs: 1 > Total files: 144 > Total blocks (validated): 287 (avg. block size 49916503 B) > Minimally replicated blocks: 287 (100.0 %) > Over-replicated blocks: 0 (0.0 %) > Under-replicated blocks: 287 (100.0 %) > Mis-replicated blocks: 0 (0.0 %) > Default replication factor: 1 > Average block replication: 2.0 > Corrupt blocks: 0 > Missing replicas: 287 (50.0 %) > Number of data-nodes: 2 > Number of racks: 1 > FSCK ended at Fri Feb 03 11:49:53 UTC 2012 in 36 milliseconds > > So even if a single file is less than the blocksize 134217728, it > occupies two blocks. That's really weird, don't you think? This is > probably the reason why distcp takes ages to complete... > > Any ideas? > > Thank you > > Marco Didonna >
