Hi, 

I'm running Spark in an EMR cluster and I'm able to read from S3 using REPL
without problems: 

val input_file = "s3://<bucket-name>/test_data.txt"
val rawdata = sc.textFile(input_file)  
val test = rawdata.collect

but when I try to run a simple standalone application reading the same data,
I get an error saying that I should provide the access keys: 

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object test {

  def main(args: Array[String]) {

    val master =
"spark://ec2-xx-xx-xxx-xxx.eu-west-1.compute.amazonaws.com:7077"
    val sparkHome = "/home/hadoop/spark/"

    val sc = new SparkContext(master, "test", sparkHome, Seq())

    val input_file = "s3://<bucket-name>/test_data.txt"
    val rawdata = sc.textFile(input_file)  
    val test = rawdata.collect
    sc.stop() 
  }
}

[error] (run-main-0) java.lang.IllegalArgumentException: AWS Access Key ID
and Secret Access Key must be specified as the username or password
(respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or
fs.s3.awsSecretAccessKey properties (respectively).
java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key
must be specified as the username or password (respectively) of a s3 URL, or
by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties
(respectively).
        at
org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:70)
        at
org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:93)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190)
        at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103)
        at com.sun.proxy.$Proxy13.initialize(Unknown Source)
        at
org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:92)
        at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:717)
        at test$.main(test.scala:17)
        at test.main(test.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

When I add the keys to the file name

val input_file = "s3://<access key>:<secret access
key>@<bucket-name>/test_data.txt"

I get an "Input path does not exist" error (keys and bucket name changed
from the error message, naturally): 

[error] (run-main-0) org.apache.hadoop.mapred.InvalidInputException: Input
path does not exist: s3://<access key>:<secret access
key>@<bucket-name>/test_data.txt
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
s3://<access key>:<secret access key>@<bucket-name>/test_data.txt
        at
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
        at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
        at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:172)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
        at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
        at
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
        at scala.Option.getOrElse(Option.scala:120)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1094)
        at org.apache.spark.rdd.RDD.collect(RDD.scala:717)
        at test3$.main(test3.scala:17)
        at test3.main(test3.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)

Any idea what's happening here? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-reading-from-S3-in-standalone-application-tp11524.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to