On 3 February 2012 17:31, Tom White <[email protected]> wrote:
> On Fri, Feb 3, 2012 at 6:28 AM, Marco Didonna <[email protected]> wrote:
>> I've tried using put and the issue vanishes...I guess it's a distcp
>> nasty issue. Even if I invoke distcp as follows
>>
>> hadoop distcp -pbr
>> s3n://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/0/1262850335774_0.arc.gz
>> /user/noiano
>>
>> The replication factor is preserved (thanks God) but the block size
>> isn't: so this files is two blocks instead of one. Any known
>> workaround?
>
> You could try
>
> hadoop distcp -D dfs.block.size=134217728 ...
>
> Or set this in your client-side Hadoop configuration, since block size
> is normally picked up from the client when writing files.
>
> S3 doesn't have the concept of block size, so it's not surprising that
> it isn't preserved by distcp.
>
> Cheers,
> Tom
>
>>

That's exactly what I did but I also added -pr option in order to
preserve the replication factor on the destination FS.

Thank you

MD

Reply via email to