On 12 Oct 2016, at 10:49, Aseem Bansal <asmbans...@gmail.com<mailto:asmbans...@gmail.com>> wrote:
Hi I want to read CSV from one bucket, do some processing and write to a different bucket. I know the way to set S3 credentials using jssc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", YOUR_ACCESS_KEY) jssc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", YOUR_SECRET_KEY) But the problem is that spark is lazy. So if do the following * set credentails 1 * read input csv * do some processing * set credentials 2 * write result csv Then there is a chance that due to laziness while reading input csv the program may try to use credentails 2. A solution is to cache the result csv but in case there is not enough storage it is possible that the csv will be re-read. So how to handle this situation? 1. use S3a as your destination 2. play with bucket configs so you don't need separate accounts to work with them. 3. Buffer output to local HDFS and then copy in (distCP?) I'd actually go for #3 as S3 isn't ideal for using as a direct destination of work