If you have lots of small files, distcp should handle that well -- it's supposed to distribute the transfer of files across the nodes in your cluster. Conductor looks interesting if you're trying to distribute the transfer of single, large file(s)...
right? -- Chris Miller On Wed, Mar 16, 2016 at 4:43 AM, Andy Davidson < [email protected]> wrote: > Hi Frank > > We have thousands of small files . Each file is between 6K to maybe 100k. > > Conductor looks interesting > > Andy > > From: Frank Austin Nothaft <[email protected]> > Date: Tuesday, March 15, 2016 at 11:59 AM > To: Andrew Davidson <[email protected]> > Cc: "user @spark" <[email protected]> > Subject: Re: newbie HDFS S3 best practices > > Hard to say with #1 without knowing your application’s characteristics; > for #2, we use conductor <https://github.com/BD2KGenomics/conductor> with > IAM roles, .boto/.aws/credentials files. > > Frank Austin Nothaft > [email protected] > [email protected] > 202-340-0466 > > On Mar 15, 2016, at 11:45 AM, Andy Davidson <[email protected] > <[email protected]>> wrote: > > We use the spark-ec2 script to create AWS clusters as needed (we do not > use AWS EMR) > > 1. will we get better performance if we copy data to HDFS before we > run instead of reading directly from S3? > > 2. What is a good way to move results from HDFS to S3? > > > It seems like there are many ways to bulk copy to s3. Many of them require > we explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@ > <AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt>. This seems like a bad > idea? > > What would you recommend? > > Thanks > > Andy > > > >
