Tried almost all the options, but it did not work. So, I ended up creating a new IAM user and the keys of this user are working fine. I am not getting Forbidden(403) exception now, but my program seems to be running infinitely. It's not throwing any exception, but continues to run continuously with following trace :
. . . . 15/05/18 17:35:44 INFO HttpServer: Starting HTTP Server 15/05/18 17:35:44 INFO Server: jetty-8.y.z-SNAPSHOT 15/05/18 17:35:44 INFO AbstractConnector: Started SocketConnector@0.0.0.0:60316 15/05/18 17:35:44 INFO Utils: Successfully started service 'HTTP file server' on port 60316. 15/05/18 17:35:44 INFO SparkEnv: Registering OutputCommitCoordinator 15/05/18 17:35:44 INFO Server: jetty-8.y.z-SNAPSHOT 15/05/18 17:35:44 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 15/05/18 17:35:44 INFO Utils: Successfully started service 'SparkUI' on port 4040. 15/05/18 17:35:44 INFO SparkUI: Started SparkUI at http://172.28.210.74:4040 15/05/18 17:35:44 INFO Executor: Starting executor ID <driver> on host localhost 15/05/18 17:35:44 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@172.28.210.74:60315/user/HeartbeatReceiver 15/05/18 17:35:44 INFO NettyBlockTransferService: Server created on 60317 15/05/18 17:35:44 INFO BlockManagerMaster: Trying to register BlockManager 15/05/18 17:35:44 INFO BlockManagerMasterActor: Registering block manager localhost:60317 with 66.9 MB RAM, BlockManagerId(<driver>, localhost, 60317) 15/05/18 17:35:44 INFO BlockManagerMaster: Registered BlockManager 15/05/18 17:35:45 WARN AmazonHttpClient: Detected a possible problem with the current JVM version (1.6.0_65). If you experience XML parsing problems using the SDK, try upgrading to a more recent JVM update. 15/05/18 17:35:47 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:47 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:47 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:48 INFO S3AFileSystem: Opening 's3a://bucket-name/avro_data/episodes.avro' for reading 15/05/18 17:35:48 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:48 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:48 INFO S3AFileSystem: Reopening avro_data/episodes.avro to seek to new offset -4 15/05/18 17:35:48 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:50 INFO MemoryStore: ensureFreeSpace(230868) called with curMem=0, maxMem=70177259 15/05/18 17:35:50 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 225.5 KB, free 66.7 MB) 15/05/18 17:35:50 INFO MemoryStore: ensureFreeSpace(31491) called with curMem=230868, maxMem=70177259 15/05/18 17:35:50 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 30.8 KB, free 66.7 MB) 15/05/18 17:35:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60317 (size: 30.8 KB, free: 66.9 MB) 15/05/18 17:35:50 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/05/18 17:35:50 INFO SparkContext: Created broadcast 0 from hadoopFile at AvroRelation.scala:82 15/05/18 17:35:50 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:50 INFO FileInputFormat: Total input paths to process : 1 15/05/18 17:35:50 INFO SparkContext: Starting job: runJob at SparkPlan.scala:122 15/05/18 17:35:50 INFO DAGScheduler: Got job 0 (runJob at SparkPlan.scala:122) with 1 output partitions (allowLocal=false) 15/05/18 17:35:50 INFO DAGScheduler: Final stage: Stage 0(runJob at SparkPlan.scala:122) 15/05/18 17:35:50 INFO DAGScheduler: Parents of final stage: List() 15/05/18 17:35:50 INFO DAGScheduler: Missing parents: List() 15/05/18 17:35:50 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at map at SparkPlan.scala:97), which has no missing parents 15/05/18 17:35:50 INFO MemoryStore: ensureFreeSpace(3448) called with curMem=262359, maxMem=70177259 15/05/18 17:35:50 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.4 KB, free 66.7 MB) 15/05/18 17:35:50 INFO MemoryStore: ensureFreeSpace(2386) called with curMem=265807, maxMem=70177259 15/05/18 17:35:50 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 66.7 MB) 15/05/18 17:35:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60317 (size: 2.3 KB, free: 66.9 MB) 15/05/18 17:35:50 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/05/18 17:35:50 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839 15/05/18 17:35:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[2] at map at SparkPlan.scala:97) 15/05/18 17:35:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/05/18 17:35:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1306 bytes) 15/05/18 17:35:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/05/18 17:35:50 INFO HadoopRDD: Input split: s3a://bucket-name/avro_data/episodes.avro:0+1 15/05/18 17:35:50 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/05/18 17:35:50 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/05/18 17:35:50 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/05/18 17:35:50 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/05/18 17:35:50 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/05/18 17:35:50 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:51 INFO S3AFileSystem: Opening 's3a://bucket-name/avro_data/episodes.avro' for reading 15/05/18 17:35:51 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:51 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:51 INFO S3AFileSystem: Reopening avro_data/episodes.avro to seek to new offset -4 15/05/18 17:35:51 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:53 INFO S3AFileSystem: Reopening avro_data/episodes.avro to seek to new offset -597 15/05/18 17:35:53 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:53 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1800 bytes result sent to driver 15/05/18 17:35:53 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2782 ms on localhost (1/1) 15/05/18 17:35:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/05/18 17:35:53 INFO DAGScheduler: Stage 0 (runJob at SparkPlan.scala:122) finished in 2.797 s 15/05/18 17:35:53 INFO DAGScheduler: Job 0 finished: runJob at SparkPlan.scala:122, took 2.974724 s 15/05/18 17:35:53 INFO SparkContext: Starting job: runJob at SparkPlan.scala:122 15/05/18 17:35:53 INFO DAGScheduler: Got job 1 (runJob at SparkPlan.scala:122) with 596 output partitions (allowLocal=false) 15/05/18 17:35:53 INFO DAGScheduler: Final stage: Stage 1(runJob at SparkPlan.scala:122) 15/05/18 17:35:53 INFO DAGScheduler: Parents of final stage: List() 15/05/18 17:35:53 INFO DAGScheduler: Missing parents: List() 15/05/18 17:35:53 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[2] at map at SparkPlan.scala:97), which has no missing parents 15/05/18 17:35:53 INFO MemoryStore: ensureFreeSpace(3448) called with curMem=268193, maxMem=70177259 15/05/18 17:35:53 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.4 KB, free 66.7 MB) 15/05/18 17:35:53 INFO MemoryStore: ensureFreeSpace(2386) called with curMem=271641, maxMem=70177259 15/05/18 17:35:53 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.3 KB, free 66.7 MB) 15/05/18 17:35:53 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60317 (size: 2.3 KB, free: 66.9 MB) 15/05/18 17:35:53 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 15/05/18 17:35:53 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839 15/05/18 17:35:53 INFO DAGScheduler: Submitting 596 missing tasks from Stage 1 (MapPartitionsRDD[2] at map at SparkPlan.scala:97) 15/05/18 17:35:53 INFO TaskSchedulerImpl: Adding task set 1.0 with 596 tasks 15/05/18 17:35:53 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1306 bytes) 15/05/18 17:35:53 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/05/18 17:35:53 INFO HadoopRDD: Input split: s3a://bucket-name/avro_data/episodes.avro:1+1 15/05/18 17:35:53 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:54 INFO S3AFileSystem: Opening 's3a://bucket-name/avro_data/episodes.avro' for reading 15/05/18 17:35:54 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:54 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:54 INFO S3AFileSystem: Reopening avro_data/episodes.avro to seek to new offset -4 15/05/18 17:35:54 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:55 INFO S3AFileSystem: Reopening avro_data/episodes.avro to seek to new offset -596 15/05/18 17:35:55 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 1 15/05/18 17:35:56 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1800 bytes result sent to driver 15/05/18 17:35:56 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1306 bytes) 15/05/18 17:35:56 INFO Executor: Running task 1.0 in stage 1.0 (TID 2) 15/05/18 17:35:56 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 2224 ms on localhost (1/596) 15/05/18 17:35:56 INFO HadoopRDD: Input split: s3a://bucket-name/avro_data/episodes.avro:2+1 15/05/18 17:35:56 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:56 INFO BlockManager: Removing broadcast 1 15/05/18 17:35:56 INFO BlockManager: Removing block broadcast_1_piece0 15/05/18 17:35:56 INFO MemoryStore: Block broadcast_1_piece0 of size 2386 dropped from memory (free 69905618) 15/05/18 17:35:56 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:60317 in memory (size: 2.3 KB, free: 66.9 MB) 15/05/18 17:35:56 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/05/18 17:35:56 INFO BlockManager: Removing block broadcast_1 15/05/18 17:35:56 INFO MemoryStore: Block broadcast_1 of size 3448 dropped from memory (free 69909066) 15/05/18 17:35:56 INFO ContextCleaner: Cleaned broadcast 1 15/05/18 17:35:56 INFO S3AFileSystem: Opening 's3a://bucket-name/avro_data/episodes.avro' for reading 15/05/18 17:35:56 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:56 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:57 INFO S3AFileSystem: Reopening avro_data/episodes.avro to seek to new offset -4 15/05/18 17:35:57 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 15/05/18 17:35:58 INFO S3AFileSystem: Reopening avro_data/episodes.avro to seek to new offset -595 15/05/18 17:35:58 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 2 15/05/18 17:35:58 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 1800 bytes result sent to driver 15/05/18 17:35:58 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1306 bytes) 15/05/18 17:35:58 INFO Executor: Running task 2.0 in stage 1.0 (TID 3) 15/05/18 17:35:58 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 2) in 2655 ms on localhost (2/596) 15/05/18 17:35:58 INFO HadoopRDD: Input split: s3a://bucket-name/avro_data/episodes.avro:3+1 15/05/18 17:35:58 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:58 INFO S3AFileSystem: Opening 's3a://bucket-name/avro_data/episodes.avro' for reading 15/05/18 17:35:58 INFO S3AFileSystem: Getting path status for s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro) 15/05/18 17:35:59 INFO S3AFileSystem: Actually opening file avro_data/episodes.avro at pos 0 . . . . And this is my code : public static void main(String[] args) { System.out.println("START..."); SparkConf conf = new SparkConf().setAppName("DataFrameDemo").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); Configuration config = sc.hadoopConfiguration(); //FOR s3a config.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"); config.set("fs.s3a.access.key","**********************"); config.set("fs.s3a.secret.key","***********************************"); SQLContext sqlContext = new SQLContext(sc); DataFrame df = sqlContext.load("s3a://bucket-name/avro_data/episodes.avro", "com.databricks.spark.avro"); // DataFrame df = sqlContext.load("/Users/miqbal1/avro_data/episodes.avro", "com.databricks.spark.avro"); df.show(); df.printSchema(); df.select("name").show(); System.out.println("DONE"); } While the same code is working fine with a local file. Am I missing something here? Any help would be highly appreciated. Thank you. [image: http://] Tariq, Mohammad about.me/mti [image: http://] <http://about.me/mti> On Sun, May 17, 2015 at 8:51 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > I think you can try this way also: > > DataFrame df = > sqlContext.load("s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro", > "com.databricks.spark.avro"); > > > Thanks > Best Regards > > On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq <donta...@gmail.com> > wrote: > >> Thanks for the suggestion Steve. I'll try that out. >> >> Read the long story last night while struggling with this :). I made sure >> that I don't have any '/' in my key. >> >> On Saturday, May 16, 2015, Steve Loughran <ste...@hortonworks.com> wrote: >> >>> >>> > On 15 May 2015, at 21:20, Mohammad Tariq <donta...@gmail.com> wrote: >>> > >>> > Thank you Ayan and Ted for the prompt response. It isn't working with >>> s3n either. >>> > >>> > And I am able to download the file. In fact I am able to read the same >>> file using s3 API without any issue. >>> > >>> >>> >>> sounds like an S3n config problem. Check your configurations - you can >>> test locally via the hdfs dfs command without even starting spark >>> >>> Oh, and if there is a "/" in your secret key, you're going to to need >>> to generate new one. Long story >>> >> >> >> -- >> >> [image: http://] >> Tariq, Mohammad >> about.me/mti >> [image: http://] >> <http://about.me/mti> >> >> >> >