I've tried using put and the issue vanishes...I guess it's a distcp
nasty issue. Even if I invoke distcp as follows

hadoop distcp -pbr
s3n://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/0/1262850335774_0.arc.gz
/user/noiano

The replication factor is preserved (thanks God) but the block size
isn't: so this files is two blocks instead of one. Any known
workaround?

Thanks for your help

MD

On 3 February 2012 13:38, Andrei Savu <[email protected]> wrote:
> That's really strange. Are you putting the files in the cluster with distcp?
> Maybe you are affected by this:
> https://issues.apache.org/jira/browse/HADOOP-1506
>
> Have you tried to put  a single file with hadoop fs -put ... ? Same issues?
>
> On Fri, Feb 3, 2012 at 1:59 PM, Marco Didonna <[email protected]> wrote:
>>
>> Hi everyone,
>> I've launched a toy hadoop cluster using whirr on EC2 with the
>> following settings: http://pastebin.com/QpBBhjnb. As you can see I've
>> modified the default hdfs settings and I find them in the
>> hdfs-site.xml on each of my cluster nodes:
>>
>> <configuration>
>>  <property>
>>    <name>dfs.block.size</name>
>>    <value>134217728</value>
>>  </property>
>>  <property>
>>    <name>dfs.replication</name>
>>    <value>1</value>
>>  </property>
>>  <property>
>>    <name>dfs.namenode.handler.count</name>
>>    <value>40</value>
>>  </property>
>>  <property>
>>    <name>dfs.data.dir</name>
>>    <value>/data/hadoop/hdfs/data</value>
>>  </property>
>>  <property>
>>    <name>dfs.datanode.du.reserved</name>
>>    <value>1073741824</value>
>>  </property>
>>  <property>
>>    <name>dfs.name.dir</name>
>>    <value>/data/hadoop/hdfs/name</value>
>>  </property>
>>  <property>
>>    <name>fs.checkpoint.dir</name>
>>    <value>/data/hadoop/hdfs/secondary</value>
>>  </property>
>> </configuration>
>>
>> The problem is that these settings seems to be ignored, take a look at
>> a ls command output:
>>
>>
>> -rw-r--r--   3 noiano supergroup  100014074 2012-02-03 11:42
>> /user/noiano/commoncr/1262851185117_0.arc.gz
>> -rw-r--r--   3 noiano supergroup  100006118 2012-02-03 11:42
>> /user/noiano/commoncr/1262851189779_0.arc.gz
>> -rw-r--r--   3 noiano supergroup  100006615 2012-02-03 11:43
>> /user/noiano/commoncr/1262851195054_0.arc.gz
>>
>> Replication factor is 3, but there's more: FSCK reports
>>
>> /user/noiano/commoncr/1262851185117_0.arc.gz 100014074 bytes, 2
>> block(s):  Under replicated blk_-8776132147475805574_1192. Target
>> Replicas is 3 but found 2 replica(s).
>>  Under replicated blk_-7884936399692653360_1197. Target Replicas is 3
>> but found 2 replica(s).
>> 0. blk_-8776132147475805574_1192 len=67108864 repl=2
>> 1. blk_-7884936399692653360_1197 len=32905210 repl=2
>>
>> /user/noiano/commoncr/1262851189779_0.arc.gz 100006118 bytes, 2
>> block(s):  Under replicated blk_-2551924706579916650_1199. Target
>> Replicas is 3 but found 2 replica(s).
>>  Under replicated blk_3881085958984927530_1202. Target Replicas is 3
>> but found 2 replica(s).
>> 0. blk_-2551924706579916650_1199 len=67108864 repl=2
>> 1. blk_3881085958984927530_1202 len=32897254 repl=2
>>
>> /user/noiano/commoncr/1262851195054_0.arc.gz 100006615 bytes, 2
>> block(s):  Under replicated blk_8331213014551445027_1204. Target
>> Replicas is 3 but found 2 replica(s).
>>  Under replicated blk_-8642619382276868802_1204. Target Replicas is 3
>> but found 2 replica(s).
>> 0. blk_8331213014551445027_1204 len=67108864 repl=2
>> 1. blk_-8642619382276868802_1204 len=32897751 repl=2
>>
>> Status: HEALTHY
>>  Total size:    14326036391 B
>>  Total dirs:    1
>>  Total files:   144
>>  Total blocks (validated):      287 (avg. block size 49916503 B)
>>  Minimally replicated blocks:   287 (100.0 %)
>>  Over-replicated blocks:        0 (0.0 %)
>>  Under-replicated blocks:       287 (100.0 %)
>>  Mis-replicated blocks:         0 (0.0 %)
>>  Default replication factor:    1
>>  Average block replication:     2.0
>>  Corrupt blocks:                0
>>  Missing replicas:              287 (50.0 %)
>>  Number of data-nodes:          2
>>  Number of racks:               1
>> FSCK ended at Fri Feb 03 11:49:53 UTC 2012 in 36 milliseconds
>>
>> So even if a single file is less than the blocksize 134217728, it
>> occupies two blocks. That's really weird, don't you think? This is
>> probably the reason why distcp takes ages to complete...
>>
>> Any ideas?
>>
>> Thank you
>>
>> Marco Didonna
>
>

Reply via email to