Re: Reading from and writing to different S3 buckets in spark

Steve Loughran Wed, 12 Oct 2016 03:06:08 -0700

On 12 Oct 2016, at 10:49, Aseem Bansal 
<asmbans...@gmail.com<mailto:asmbans...@gmail.com>> wrote:


Hi

I want to read CSV from one bucket, do some processing and write to a different 
bucket. I know the way to set S3 credentials using

jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY)
jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY)

But the problem is that spark is lazy. So if do the following

  *   set credentails 1
  *   read input csv
  *   do some processing
  *   set credentials 2
  *   write result csv

Then there is a chance that due to laziness while reading input csv the program 
may try to use credentails 2.

A solution is to cache the result csv but in case there is not enough storage 
it is possible that the csv will be re-read. So how to handle this situation?


1. use S3a as your destination
2. play with bucket configs so you don't need separate accounts to work with 
them.
3. Buffer output to local HDFS and then copy in (distCP?)

I'd actually go for #3 as S3 isn't ideal for using as a direct destination of 
work

Re: Reading from and writing to different S3 buckets in spark

Reply via email to