Re: newbie HDFS S3 best practices

2016-03-16 Thread Chris Miller
, 2016 at 11:59 AM > To: Andrew Davidson > Cc: "user @spark" > Subject: Re: newbie HDFS S3 best practices > > Hard to say with #1 without knowing your application’s characteristics; > for #2, we use conductor <https://github.com/BD2KGenomics/conductor>

Re: newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
Hi Frank We have thousands of small files . Each file is between 6K to maybe 100k. Conductor looks interesting Andy From: Frank Austin Nothaft Date: Tuesday, March 15, 2016 at 11:59 AM To: Andrew Davidson Cc: "user @spark" Subject: Re: newbie HDFS S3 best practices > Har

Re: newbie HDFS S3 best practices

2016-03-15 Thread Frank Austin Nothaft
Hard to say with #1 without knowing your application’s characteristics; for #2, we use conductor with IAM roles, .boto/.aws/credentials files. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 > On Mar 15, 2016, at 11:

newbie HDFS S3 best practices

2016-03-15 Thread Andy Davidson
We use the spark-ec2 script to create AWS clusters as needed (we do not use AWS EMR) 1. will we get better performance if we copy data to HDFS before we run instead of reading directly from S3? 2. What is a good way to move results from HDFS to S3? It seems like there are many ways to bulk copy