[jira] [Updated] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled
[ https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-21374: - Fix Version/s: 2.2.1 > Reading globbed paths from S3 into DF doesn't work if filesystem caching is > disabled > > > Key: SPARK-21374 > URL: https://issues.apache.org/jira/browse/SPARK-21374 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.1 >Reporter: Andrey Taptunov >Assignee: Andrey Taptunov > Fix For: 2.2.1, 2.3.0 > > > *Motivation:* > In my case I want to disable filesystem cache to be able to change S3's > access key and secret key on the fly to read from buckets with different > permissions. This works perfectly fine for RDDs but doesn't work for DFs. > *Example (works for RDD but fails for DataFrame):* > {code:java} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > object SimpleApp { > def main(args: Array[String]) { > val awsAccessKeyId = "something" > val awsSecretKey = "something else" > val conf = new SparkConf().setAppName("Simple > Application").setMaster("local[*]") > val sc = new SparkContext(conf) > sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId) > sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey) > sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true) > > sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") > sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp") > val spark = SparkSession.builder().config(conf).getOrCreate() > val rddFile = sc.textFile("s3://bucket/file.csv").count // ok > val rddGlob = sc.textFile("s3://bucket/*").count // ok > val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count > // ok > > val dfGlob = spark.read.format("csv").load("s3://bucket/*").count > // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must > be specified as the username or password (respectively) > // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or > fs.s3.awsSecretAccessKey properties (respectively). > > sc.stop() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled
[ https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Taptunov updated SPARK-21374: Description: *Motivation:* In my case I want to disable filesystem cache to be able to change S3's access key and secret key on the fly to read from buckets with different permissions. This works perfectly fine for RDDs but doesn't work for DFs. *Example (works for RDD but fails for DataFrame):* {code:java} import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession object SimpleApp { def main(args: Array[String]) { val awsAccessKeyId = "something" val awsSecretKey = "something else" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId) sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey) sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true) sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp") val spark = SparkSession.builder().config(conf).getOrCreate() val rddFile = sc.textFile("s3://bucket/file.csv").count // ok val rddGlob = sc.textFile("s3://bucket/*").count // ok val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // ok val dfGlob = spark.read.format("csv").load("s3://bucket/*").count // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). sc.stop() } } {code} was: *Motivation:* Filesystem configuration is not part of cache's key which is used to find instance of FileSystem in filesystem cache, where they are stored by default. In my case I have to disable filesystem cache to be able to change access key and secret key on the fly to read from buckets with different permissions. *Example (works for RDD but fails for DataFrame):* {code:java} import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession object SimpleApp { def main(args: Array[String]) { val awsAccessKeyId = "something" val awsSecretKey = "something else" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId) sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey) sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true) sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp") val spark = SparkSession.builder().config(conf).getOrCreate() val rddFile = sc.textFile("s3://bucket/file.csv").count // ok val rddGlob = sc.textFile("s3://bucket/*").count // ok val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // ok val dfGlob = spark.read.format("csv").load("s3://bucket/*").count // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). sc.stop() } } {code} *Analysis:* It looks like the problem lies in SparkHadoopUtil.globPath method which uses "conf" object to create an instance of FileSystem. If caching is enabled (default behavior) instance of FileSystem is retrieved from cache so "conf" object is just omitted, however if caching is disabled (my case) this "conf" object is used to create instance of FileSystem without settings which are set by user. > Reading globbed paths from S3 into DF doesn't work if filesystem caching is > disabled > > > Key: SPARK-21374 > URL: https://issues.apache.org/jira/browse/SPARK-21374 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.1 >Reporter: Andrey Taptunov > > *Motivation:* > In my case I want to disable filesystem cache to be able to change S3's > access key and secret key on the fly to read from buckets with different > permissions. This works perfectly fine for RDDs but doesn't work for DFs. > *Example (works for RDD but fails for DataFrame):* > {code:java} > import org.apache.spark.SparkContext > import org.apache.spark.SparkConf > import org.apache.spark.sql.SparkSession > object SimpleApp { > def main(args:
[jira] [Updated] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled
[ https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Taptunov updated SPARK-21374: Description: *Motivation:* Filesystem configuration is not part of cache's key which is used to find instance of FileSystem in filesystem cache, where they are stored by default. In my case I have to disable filesystem cache to be able to change access key and secret key on the fly to read from buckets with different permissions. *Example (works for RDD but fails for DataFrame):* {code:java} import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession object SimpleApp { def main(args: Array[String]) { val awsAccessKeyId = "something" val awsSecretKey = "something else" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId) sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey) sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true) sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp") val spark = SparkSession.builder().config(conf).getOrCreate() val rddFile = sc.textFile("s3://bucket/file.csv").count // ok val rddGlob = sc.textFile("s3://bucket/*").count // ok val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // ok val dfGlob = spark.read.format("csv").load("s3://bucket/*").count // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). sc.stop() } } {code} *Analysis:* It looks like the problem lies in SparkHadoopUtil.globPath method which uses "conf" object to create an instance of FileSystem. If caching is enabled (default behavior) instance of FileSystem is retrieved from cache so "conf" object is just omitted, however if caching is disabled (my case) this "conf" object is used to create instance of FileSystem without settings which are set by user. was: *Motivation:* Filesystem configuration is not part of cache's key which is used to find instance of FileSystem in filesystem cache, where they are stored by default. In my case I have to disable filesystem cache to be able to change access key and secret key on the fly to read from buckets with different permissions. *Example (works for RDD but fails for DataFrame):* {code:java} import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession object SimpleApp { def main(args: Array[String]) { val awsAccessKeyId = "something" val awsSecretKey = "something else" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId) sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey) sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true) sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp") val spark = SparkSession.builder().config(conf).getOrCreate() val rddFile = sc.textFile("s3://bucket/file.csv").count // ok val rddGlob = sc.textFile("s3://bucket/*").count // ok val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // ok val dfGlob = spark.read.format("csv").load("s3://bucket/*").count /* IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). */ sc.stop() } } {code} *Analysis:* It looks like the problem lies in SparkHadoopUtil.globPath method which uses "conf" object to create an instance of FileSystem. If caching is enabled (default behavior) instance of FileSystem is retrieved from cache so "conf" object is just omitted, however if caching is disabled (my case) this "conf" object is used to create instance of FileSystem without settings which are set by user. I would be happy to help with pull request if you agree that this is a bug. > Reading globbed paths from S3 into DF doesn't work if filesystem caching is > disabled > > > Key: SPARK-21374 > URL: https://issues.apache.org/jira/browse/SPARK-21374 > Project: Spark > Issue Type: Bug >
[jira] [Updated] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled
[ https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Taptunov updated SPARK-21374: Description: *Motivation:* Filesystem configuration is not part of cache's key which is used to find instance of FileSystem in filesystem cache, where they are stored by default. In my case I have to disable filesystem cache to be able to change access key and secret key on the fly to read from buckets with different permissions. *Example (works for RDD but fails for DataFrame):* {code:java} import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession object SimpleApp { def main(args: Array[String]) { val awsAccessKeyId = "something" val awsSecretKey = "something else" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId) sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey) sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true) sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp") val spark = SparkSession.builder().config(conf).getOrCreate() val rddFile = sc.textFile("s3://bucket/file.csv").count // ok val rddGlob = sc.textFile("s3://bucket/*").count // ok val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // ok val dfGlob = spark.read.format("csv").load("s3://bucket/*").count /* IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively). */ sc.stop() } } {code} *Analysis:* It looks like the problem lies in SparkHadoopUtil.globPath method which uses "conf" object to create an instance of FileSystem. If caching is enabled (default behavior) instance of FileSystem is retrieved from cache so "conf" object is just omitted, however if caching is disabled (my case) this "conf" object is used to create instance of FileSystem without settings which are set by user. I would be happy to help with pull request if you agree that this is a bug. was: *Motivation:* Filesystem configuration is not part of cache's key which is used to find instance of FileSystem in filesystem cache, where they are stored by default. In my case I have to disable filesystem cache to be able to change access key and secret key on the fly to read from buckets with different permissions. *Example (works for RDD but fails for DataFrame):* {code:java} import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.sql.SparkSession object SimpleApp { def main(args: Array[String]) { val awsAccessKeyId = "something" val awsSecretKey = "something else" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId) sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey) sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true) sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem") sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp") val spark = SparkSession.builder().config(conf).getOrCreate() val rddFile = sc.textFile("s3://bucket/file.csv").count // ok val rddGlob = sc.textFile("s3://bucket/*").count // ok val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count // ok val dfGlob = spark.read.format("csv").load("s3://bucket/*").count // IllegalArgumentExcepton sc.stop() } } {code} *Analysis:* It looks like the problem lies in SparkHadoopUtil.globPath method which uses "conf" object to create an instance of FileSystem. If caching is enabled (default behavior) instance of FileSystem is retrieved from cache so "conf" object is just omitted, however if caching is disabled (my case) this "conf" object is used to create instance of FileSystem without settings which are set by user. I would be happy to help with pull request if you agree that this is a bug. > Reading globbed paths from S3 into DF doesn't work if filesystem caching is > disabled > > > Key: SPARK-21374 > URL: https://issues.apache.org/jira/browse/SPARK-21374 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2, 2.1.1 >Reporter: Andrey Taptunov > > *Motivation:* > Filesystem configuration is not