Re: newbie HDFS S3 best practices

Chris Miller Wed, 16 Mar 2016 01:00:02 -0700

If you have lots of small files, distcp should handle that well -- it's
supposed to distribute the transfer of files across the nodes in your
cluster. Conductor looks interesting if you're trying to distribute the
transfer of single, large file(s)...


right?

--
Chris Miller

On Wed, Mar 16, 2016 at 4:43 AM, Andy Davidson <
[email protected]> wrote:

> Hi Frank
>
> We have thousands of small files . Each file is between 6K to maybe 100k.
>
> Conductor looks interesting
>
> Andy
>
> From: Frank Austin Nothaft <[email protected]>
> Date: Tuesday, March 15, 2016 at 11:59 AM
> To: Andrew Davidson <[email protected]>
> Cc: "user @spark" <[email protected]>
> Subject: Re: newbie HDFS S3 best practices
>
> Hard to say with #1 without knowing your application’s characteristics;
> for #2, we use conductor <https://github.com/BD2KGenomics/conductor> with
> IAM roles, .boto/.aws/credentials files.
>
> Frank Austin Nothaft
> [email protected]
> [email protected]
> 202-340-0466
>
> On Mar 15, 2016, at 11:45 AM, Andy Davidson <[email protected]
> <[email protected]>> wrote:
>
> We use the spark-ec2 script to create AWS clusters as needed (we do not
> use AWS EMR)
>
>    1. will we get better performance if we copy data to HDFS before we
>    run instead of reading directly from S3?
>
>  2. What is a good way to move results from HDFS to S3?
>
>
> It seems like there are many ways to bulk copy to s3. Many of them require
> we explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
> <AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt>. This seems like a bad
> idea?
>
> What would you recommend?
>
> Thanks
>
> Andy
>
>
>
>

Re: newbie HDFS S3 best practices

Reply via email to