Re: S3 read/write from PySpark

2020-08-11 Thread Stephen Coy
Hi there, Also for the benefit of others, if you attempt to use any version of Hadoop > 3.2.0 (such as 3.2.1), you will need to update the version of Google Guava used by Apache Spark to that consumed by Hadoop. Hadoop 3.2.1 requires guava-27.0-jre.jar. The latest is guava-29.0-jre.jar which a

Re: S3 read/write from PySpark

2020-08-06 Thread Daniel Stojanov
Hi, Thanks for your help. Problem solved, but I thought I should add something in case this problem is encountered by others. Both responses are correct; BasicAWSCredentialsProvider is gone, but simply making the substitution leads to the traceback just below. java.lang.NoSuchMethodError: 'void c

Re: S3 read/write from PySpark

2020-08-06 Thread Stephen Coy
Hi Daniel, It looks like …BasicAWSCredentialsProvider has become org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider. However, the way that the username and password are provided appears to have changed so you will probably need to look in to that. Cheers, Steve C On 6 Aug 2020, at 11:15 a

Re: S3 read/write from PySpark

2020-08-05 Thread German Schiavon
Hey, I think *BasicAWSCredentialsProvider *is no longer supported by hadoop. I couldn't find it the master branch but I could in 2.8 branch. Maybe that's why with Hadoop 2.7 works. I use *TemporaryAWSCredentialsProvider.* Hope it helps On Thu, 6 Aug 2020 at 03:16, Daniel Stojanov wrote: > Hi,

S3 read/write from PySpark

2020-08-05 Thread Daniel Stojanov
Hi, I am trying to read/write files to S3 from PySpark. The procedure that I have used is to download Spark, start PySpark with the hadoop-aws, guava, aws-java-sdk-bundle packages. The versions are explicitly specified by looking up the exact dependency version on Maven. Allowing dependencies to b