I've tried using put and the issue vanishes...I guess it's a distcp nasty issue. Even if I invoke distcp as follows
hadoop distcp -pbr s3n://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/0/1262850335774_0.arc.gz /user/noiano The replication factor is preserved (thanks God) but the block size isn't: so this files is two blocks instead of one. Any known workaround? Thanks for your help MD On 3 February 2012 13:38, Andrei Savu <[email protected]> wrote: > That's really strange. Are you putting the files in the cluster with distcp? > Maybe you are affected by this: > https://issues.apache.org/jira/browse/HADOOP-1506 > > Have you tried to put a single file with hadoop fs -put ... ? Same issues? > > On Fri, Feb 3, 2012 at 1:59 PM, Marco Didonna <[email protected]> wrote: >> >> Hi everyone, >> I've launched a toy hadoop cluster using whirr on EC2 with the >> following settings: http://pastebin.com/QpBBhjnb. As you can see I've >> modified the default hdfs settings and I find them in the >> hdfs-site.xml on each of my cluster nodes: >> >> <configuration> >> <property> >> <name>dfs.block.size</name> >> <value>134217728</value> >> </property> >> <property> >> <name>dfs.replication</name> >> <value>1</value> >> </property> >> <property> >> <name>dfs.namenode.handler.count</name> >> <value>40</value> >> </property> >> <property> >> <name>dfs.data.dir</name> >> <value>/data/hadoop/hdfs/data</value> >> </property> >> <property> >> <name>dfs.datanode.du.reserved</name> >> <value>1073741824</value> >> </property> >> <property> >> <name>dfs.name.dir</name> >> <value>/data/hadoop/hdfs/name</value> >> </property> >> <property> >> <name>fs.checkpoint.dir</name> >> <value>/data/hadoop/hdfs/secondary</value> >> </property> >> </configuration> >> >> The problem is that these settings seems to be ignored, take a look at >> a ls command output: >> >> >> -rw-r--r-- 3 noiano supergroup 100014074 2012-02-03 11:42 >> /user/noiano/commoncr/1262851185117_0.arc.gz >> -rw-r--r-- 3 noiano supergroup 100006118 2012-02-03 11:42 >> /user/noiano/commoncr/1262851189779_0.arc.gz >> -rw-r--r-- 3 noiano supergroup 100006615 2012-02-03 11:43 >> /user/noiano/commoncr/1262851195054_0.arc.gz >> >> Replication factor is 3, but there's more: FSCK reports >> >> /user/noiano/commoncr/1262851185117_0.arc.gz 100014074 bytes, 2 >> block(s): Under replicated blk_-8776132147475805574_1192. Target >> Replicas is 3 but found 2 replica(s). >> Under replicated blk_-7884936399692653360_1197. Target Replicas is 3 >> but found 2 replica(s). >> 0. blk_-8776132147475805574_1192 len=67108864 repl=2 >> 1. blk_-7884936399692653360_1197 len=32905210 repl=2 >> >> /user/noiano/commoncr/1262851189779_0.arc.gz 100006118 bytes, 2 >> block(s): Under replicated blk_-2551924706579916650_1199. Target >> Replicas is 3 but found 2 replica(s). >> Under replicated blk_3881085958984927530_1202. Target Replicas is 3 >> but found 2 replica(s). >> 0. blk_-2551924706579916650_1199 len=67108864 repl=2 >> 1. blk_3881085958984927530_1202 len=32897254 repl=2 >> >> /user/noiano/commoncr/1262851195054_0.arc.gz 100006615 bytes, 2 >> block(s): Under replicated blk_8331213014551445027_1204. Target >> Replicas is 3 but found 2 replica(s). >> Under replicated blk_-8642619382276868802_1204. Target Replicas is 3 >> but found 2 replica(s). >> 0. blk_8331213014551445027_1204 len=67108864 repl=2 >> 1. blk_-8642619382276868802_1204 len=32897751 repl=2 >> >> Status: HEALTHY >> Total size: 14326036391 B >> Total dirs: 1 >> Total files: 144 >> Total blocks (validated): 287 (avg. block size 49916503 B) >> Minimally replicated blocks: 287 (100.0 %) >> Over-replicated blocks: 0 (0.0 %) >> Under-replicated blocks: 287 (100.0 %) >> Mis-replicated blocks: 0 (0.0 %) >> Default replication factor: 1 >> Average block replication: 2.0 >> Corrupt blocks: 0 >> Missing replicas: 287 (50.0 %) >> Number of data-nodes: 2 >> Number of racks: 1 >> FSCK ended at Fri Feb 03 11:49:53 UTC 2012 in 36 milliseconds >> >> So even if a single file is less than the blocksize 134217728, it >> occupies two blocks. That's really weird, don't you think? This is >> probably the reason why distcp takes ages to complete... >> >> Any ideas? >> >> Thank you >> >> Marco Didonna > >
