On Fri, Feb 3, 2012 at 6:28 AM, Marco Didonna <[email protected]> wrote:
> I've tried using put and the issue vanishes...I guess it's a distcp
> nasty issue. Even if I invoke distcp as follows
>
> hadoop distcp -pbr
> s3n://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/0/1262850335774_0.arc.gz
> /user/noiano
>
> The replication factor is preserved (thanks God) but the block size
> isn't: so this files is two blocks instead of one. Any known
> workaround?

You could try

hadoop distcp -D dfs.block.size=134217728 ...

Or set this in your client-side Hadoop configuration, since block size
is normally picked up from the client when writing files.

S3 doesn't have the concept of block size, so it's not surprising that
it isn't preserved by distcp.

Cheers,
Tom

>
> Thanks for your help
>
> MD
>
> On 3 February 2012 13:38, Andrei Savu <[email protected]> wrote:
>> That's really strange. Are you putting the files in the cluster with distcp?
>> Maybe you are affected by this:
>> https://issues.apache.org/jira/browse/HADOOP-1506
>>
>> Have you tried to put  a single file with hadoop fs -put ... ? Same issues?
>>
>> On Fri, Feb 3, 2012 at 1:59 PM, Marco Didonna <[email protected]> wrote:
>>>
>>> Hi everyone,
>>> I've launched a toy hadoop cluster using whirr on EC2 with the
>>> following settings: http://pastebin.com/QpBBhjnb. As you can see I've
>>> modified the default hdfs settings and I find them in the
>>> hdfs-site.xml on each of my cluster nodes:
>>>
>>> <configuration>
>>>  <property>
>>>    <name>dfs.block.size</name>
>>>    <value>134217728</value>
>>>  </property>
>>>  <property>
>>>    <name>dfs.replication</name>
>>>    <value>1</value>
>>>  </property>
>>>  <property>
>>>    <name>dfs.namenode.handler.count</name>
>>>    <value>40</value>
>>>  </property>
>>>  <property>
>>>    <name>dfs.data.dir</name>
>>>    <value>/data/hadoop/hdfs/data</value>
>>>  </property>
>>>  <property>
>>>    <name>dfs.datanode.du.reserved</name>
>>>    <value>1073741824</value>
>>>  </property>
>>>  <property>
>>>    <name>dfs.name.dir</name>
>>>    <value>/data/hadoop/hdfs/name</value>
>>>  </property>
>>>  <property>
>>>    <name>fs.checkpoint.dir</name>
>>>    <value>/data/hadoop/hdfs/secondary</value>
>>>  </property>
>>> </configuration>
>>>
>>> The problem is that these settings seems to be ignored, take a look at
>>> a ls command output:
>>>
>>>
>>> -rw-r--r--   3 noiano supergroup  100014074 2012-02-03 11:42
>>> /user/noiano/commoncr/1262851185117_0.arc.gz
>>> -rw-r--r--   3 noiano supergroup  100006118 2012-02-03 11:42
>>> /user/noiano/commoncr/1262851189779_0.arc.gz
>>> -rw-r--r--   3 noiano supergroup  100006615 2012-02-03 11:43
>>> /user/noiano/commoncr/1262851195054_0.arc.gz
>>>
>>> Replication factor is 3, but there's more: FSCK reports
>>>
>>> /user/noiano/commoncr/1262851185117_0.arc.gz 100014074 bytes, 2
>>> block(s):  Under replicated blk_-8776132147475805574_1192. Target
>>> Replicas is 3 but found 2 replica(s).
>>>  Under replicated blk_-7884936399692653360_1197. Target Replicas is 3
>>> but found 2 replica(s).
>>> 0. blk_-8776132147475805574_1192 len=67108864 repl=2
>>> 1. blk_-7884936399692653360_1197 len=32905210 repl=2
>>>
>>> /user/noiano/commoncr/1262851189779_0.arc.gz 100006118 bytes, 2
>>> block(s):  Under replicated blk_-2551924706579916650_1199. Target
>>> Replicas is 3 but found 2 replica(s).
>>>  Under replicated blk_3881085958984927530_1202. Target Replicas is 3
>>> but found 2 replica(s).
>>> 0. blk_-2551924706579916650_1199 len=67108864 repl=2
>>> 1. blk_3881085958984927530_1202 len=32897254 repl=2
>>>
>>> /user/noiano/commoncr/1262851195054_0.arc.gz 100006615 bytes, 2
>>> block(s):  Under replicated blk_8331213014551445027_1204. Target
>>> Replicas is 3 but found 2 replica(s).
>>>  Under replicated blk_-8642619382276868802_1204. Target Replicas is 3
>>> but found 2 replica(s).
>>> 0. blk_8331213014551445027_1204 len=67108864 repl=2
>>> 1. blk_-8642619382276868802_1204 len=32897751 repl=2
>>>
>>> Status: HEALTHY
>>>  Total size:    14326036391 B
>>>  Total dirs:    1
>>>  Total files:   144
>>>  Total blocks (validated):      287 (avg. block size 49916503 B)
>>>  Minimally replicated blocks:   287 (100.0 %)
>>>  Over-replicated blocks:        0 (0.0 %)
>>>  Under-replicated blocks:       287 (100.0 %)
>>>  Mis-replicated blocks:         0 (0.0 %)
>>>  Default replication factor:    1
>>>  Average block replication:     2.0
>>>  Corrupt blocks:                0
>>>  Missing replicas:              287 (50.0 %)
>>>  Number of data-nodes:          2
>>>  Number of racks:               1
>>> FSCK ended at Fri Feb 03 11:49:53 UTC 2012 in 36 milliseconds
>>>
>>> So even if a single file is less than the blocksize 134217728, it
>>> occupies two blocks. That's really weird, don't you think? This is
>>> probably the reason why distcp takes ages to complete...
>>>
>>> Any ideas?
>>>
>>> Thank you
>>>
>>> Marco Didonna
>>
>>

Reply via email to