[jira] [Assigned] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled

2017-08-04 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21374:
---

Assignee: Andrey Taptunov

> Reading globbed paths from S3 into DF doesn't work if filesystem caching is 
> disabled
> 
>
> Key: SPARK-21374
> URL: https://issues.apache.org/jira/browse/SPARK-21374
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Andrey Taptunov
>Assignee: Andrey Taptunov
>
> *Motivation:*
> In my case I want to disable filesystem cache to be able to change S3's 
> access key and secret key on the fly to read from buckets with different 
> permissions. This works perfectly fine for RDDs but doesn't work for DFs.
> *Example (works for RDD but fails for DataFrame):*
> {code:java}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> object SimpleApp {
>   def main(args: Array[String]) {
> val awsAccessKeyId = "something"
> val awsSecretKey = "something else"
> val conf = new SparkConf().setAppName("Simple 
> Application").setMaster("local[*]")
> val sc = new SparkContext(conf)
> sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
> sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
> sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)
> 
> sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
> sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")
> val spark = SparkSession.builder().config(conf).getOrCreate()
> val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
> val rddGlob = sc.textFile("s3://bucket/*").count // ok
> val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count 
> // ok
> 
> val dfGlob = spark.read.format("csv").load("s3://bucket/*").count 
> // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must 
> be specified as the username or password (respectively)
> // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
> fs.s3.awsSecretAccessKey properties (respectively).
>
> sc.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled

2017-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21374:


Assignee: (was: Apache Spark)

> Reading globbed paths from S3 into DF doesn't work if filesystem caching is 
> disabled
> 
>
> Key: SPARK-21374
> URL: https://issues.apache.org/jira/browse/SPARK-21374
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Andrey Taptunov
>
> *Motivation:*
> In my case I want to disable filesystem cache to be able to change S3's 
> access key and secret key on the fly to read from buckets with different 
> permissions. This works perfectly fine for RDDs but doesn't work for DFs.
> *Example (works for RDD but fails for DataFrame):*
> {code:java}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> object SimpleApp {
>   def main(args: Array[String]) {
> val awsAccessKeyId = "something"
> val awsSecretKey = "something else"
> val conf = new SparkConf().setAppName("Simple 
> Application").setMaster("local[*]")
> val sc = new SparkContext(conf)
> sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
> sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
> sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)
> 
> sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
> sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")
> val spark = SparkSession.builder().config(conf).getOrCreate()
> val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
> val rddGlob = sc.textFile("s3://bucket/*").count // ok
> val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count 
> // ok
> 
> val dfGlob = spark.read.format("csv").load("s3://bucket/*").count 
> // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must 
> be specified as the username or password (respectively)
> // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
> fs.s3.awsSecretAccessKey properties (respectively).
>
> sc.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21374) Reading globbed paths from S3 into DF doesn't work if filesystem caching is disabled

2017-07-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21374:


Assignee: Apache Spark

> Reading globbed paths from S3 into DF doesn't work if filesystem caching is 
> disabled
> 
>
> Key: SPARK-21374
> URL: https://issues.apache.org/jira/browse/SPARK-21374
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.2, 2.1.1
>Reporter: Andrey Taptunov
>Assignee: Apache Spark
>
> *Motivation:*
> In my case I want to disable filesystem cache to be able to change S3's 
> access key and secret key on the fly to read from buckets with different 
> permissions. This works perfectly fine for RDDs but doesn't work for DFs.
> *Example (works for RDD but fails for DataFrame):*
> {code:java}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SparkSession
> object SimpleApp {
>   def main(args: Array[String]) {
> val awsAccessKeyId = "something"
> val awsSecretKey = "something else"
> val conf = new SparkConf().setAppName("Simple 
> Application").setMaster("local[*]")
> val sc = new SparkContext(conf)
> sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
> sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretKey)
> sc.hadoopConfiguration.setBoolean("fs.s3.impl.disable.cache",true)
> 
> sc.hadoopConfiguration.set("fs.s3.impl","org.apache.hadoop.fs.s3native.NativeS3FileSystem")
> sc.hadoopConfiguration.set("fs.s3.buffer.dir","/tmp")
> val spark = SparkSession.builder().config(conf).getOrCreate()
> val rddFile = sc.textFile("s3://bucket/file.csv").count // ok
> val rddGlob = sc.textFile("s3://bucket/*").count // ok
> val dfFile = spark.read.format("csv").load("s3://bucket/file.csv").count 
> // ok
> 
> val dfGlob = spark.read.format("csv").load("s3://bucket/*").count 
> // IllegalArgumentExcepton. AWS Access Key ID and Secret Access Key must 
> be specified as the username or password (respectively)
> // of a s3 URL, or by setting the fs.s3.awsAccessKeyId or 
> fs.s3.awsSecretAccessKey properties (respectively).
>
> sc.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org