On 3 February 2012 17:31, Tom White <[email protected]> wrote: > On Fri, Feb 3, 2012 at 6:28 AM, Marco Didonna <[email protected]> wrote: >> I've tried using put and the issue vanishes...I guess it's a distcp >> nasty issue. Even if I invoke distcp as follows >> >> hadoop distcp -pbr >> s3n://aws-publicdatasets/common-crawl/crawl-002/2010/01/06/0/1262850335774_0.arc.gz >> /user/noiano >> >> The replication factor is preserved (thanks God) but the block size >> isn't: so this files is two blocks instead of one. Any known >> workaround? > > You could try > > hadoop distcp -D dfs.block.size=134217728 ... > > Or set this in your client-side Hadoop configuration, since block size > is normally picked up from the client when writing files. > > S3 doesn't have the concept of block size, so it's not surprising that > it isn't preserved by distcp. > > Cheers, > Tom > >>
That's exactly what I did but I also added -pr option in order to preserve the replication factor on the destination FS. Thank you MD
