Tried almost all the options, but it did not work. So, I ended up creating
a new IAM user and the keys of this user are working fine. I am not getting
Forbidden(403) exception now, but my program seems to be running
infinitely. It's not throwing any exception, but continues to run
continuously with following trace :

.
.
.
.
15/05/18 17:35:44 INFO HttpServer: Starting HTTP Server
15/05/18 17:35:44 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/18 17:35:44 INFO AbstractConnector: Started
SocketConnector@0.0.0.0:60316
15/05/18 17:35:44 INFO Utils: Successfully started service 'HTTP file
server' on port 60316.
15/05/18 17:35:44 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/18 17:35:44 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/18 17:35:44 INFO AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
15/05/18 17:35:44 INFO Utils: Successfully started service 'SparkUI' on
port 4040.
15/05/18 17:35:44 INFO SparkUI: Started SparkUI at http://172.28.210.74:4040
15/05/18 17:35:44 INFO Executor: Starting executor ID <driver> on host
localhost
15/05/18 17:35:44 INFO AkkaUtils: Connecting to HeartbeatReceiver:
akka.tcp://sparkDriver@172.28.210.74:60315/user/HeartbeatReceiver
15/05/18 17:35:44 INFO NettyBlockTransferService: Server created on 60317
15/05/18 17:35:44 INFO BlockManagerMaster: Trying to register BlockManager
15/05/18 17:35:44 INFO BlockManagerMasterActor: Registering block manager
localhost:60317 with 66.9 MB RAM, BlockManagerId(<driver>, localhost, 60317)
15/05/18 17:35:44 INFO BlockManagerMaster: Registered BlockManager
15/05/18 17:35:45 WARN AmazonHttpClient: Detected a possible problem with
the current JVM version (1.6.0_65).  If you experience XML parsing problems
using the SDK, try upgrading to a more recent JVM update.
15/05/18 17:35:47 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:47 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:47 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:48 INFO S3AFileSystem: Opening
's3a://bucket-name/avro_data/episodes.avro' for reading
15/05/18 17:35:48 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:48 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:48 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -4
15/05/18 17:35:48 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:50 INFO MemoryStore: ensureFreeSpace(230868) called with
curMem=0, maxMem=70177259
15/05/18 17:35:50 INFO MemoryStore: Block broadcast_0 stored as values in
memory (estimated size 225.5 KB, free 66.7 MB)
15/05/18 17:35:50 INFO MemoryStore: ensureFreeSpace(31491) called with
curMem=230868, maxMem=70177259
15/05/18 17:35:50 INFO MemoryStore: Block broadcast_0_piece0 stored as
bytes in memory (estimated size 30.8 KB, free 66.7 MB)
15/05/18 17:35:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on localhost:60317 (size: 30.8 KB, free: 66.9 MB)
15/05/18 17:35:50 INFO BlockManagerMaster: Updated info of block
broadcast_0_piece0
15/05/18 17:35:50 INFO SparkContext: Created broadcast 0 from hadoopFile at
AvroRelation.scala:82
15/05/18 17:35:50 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:50 INFO FileInputFormat: Total input paths to process : 1
15/05/18 17:35:50 INFO SparkContext: Starting job: runJob at
SparkPlan.scala:122
15/05/18 17:35:50 INFO DAGScheduler: Got job 0 (runJob at
SparkPlan.scala:122) with 1 output partitions (allowLocal=false)
15/05/18 17:35:50 INFO DAGScheduler: Final stage: Stage 0(runJob at
SparkPlan.scala:122)
15/05/18 17:35:50 INFO DAGScheduler: Parents of final stage: List()
15/05/18 17:35:50 INFO DAGScheduler: Missing parents: List()
15/05/18 17:35:50 INFO DAGScheduler: Submitting Stage 0
(MapPartitionsRDD[2] at map at SparkPlan.scala:97), which has no missing
parents
15/05/18 17:35:50 INFO MemoryStore: ensureFreeSpace(3448) called with
curMem=262359, maxMem=70177259
15/05/18 17:35:50 INFO MemoryStore: Block broadcast_1 stored as values in
memory (estimated size 3.4 KB, free 66.7 MB)
15/05/18 17:35:50 INFO MemoryStore: ensureFreeSpace(2386) called with
curMem=265807, maxMem=70177259
15/05/18 17:35:50 INFO MemoryStore: Block broadcast_1_piece0 stored as
bytes in memory (estimated size 2.3 KB, free 66.7 MB)
15/05/18 17:35:50 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory
on localhost:60317 (size: 2.3 KB, free: 66.9 MB)
15/05/18 17:35:50 INFO BlockManagerMaster: Updated info of block
broadcast_1_piece0
15/05/18 17:35:50 INFO SparkContext: Created broadcast 1 from broadcast at
DAGScheduler.scala:839
15/05/18 17:35:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage
0 (MapPartitionsRDD[2] at map at SparkPlan.scala:97)
15/05/18 17:35:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/05/18 17:35:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
0, localhost, PROCESS_LOCAL, 1306 bytes)
15/05/18 17:35:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/05/18 17:35:50 INFO HadoopRDD: Input split:
s3a://bucket-name/avro_data/episodes.avro:0+1
15/05/18 17:35:50 INFO deprecation: mapred.tip.id is deprecated. Instead,
use mapreduce.task.id
15/05/18 17:35:50 INFO deprecation: mapred.task.id is deprecated. Instead,
use mapreduce.task.attempt.id
15/05/18 17:35:50 INFO deprecation: mapred.task.is.map is deprecated.
Instead, use mapreduce.task.ismap
15/05/18 17:35:50 INFO deprecation: mapred.task.partition is deprecated.
Instead, use mapreduce.task.partition
15/05/18 17:35:50 INFO deprecation: mapred.job.id is deprecated. Instead,
use mapreduce.job.id
15/05/18 17:35:50 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:51 INFO S3AFileSystem: Opening
's3a://bucket-name/avro_data/episodes.avro' for reading
15/05/18 17:35:51 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:51 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:51 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -4
15/05/18 17:35:51 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:53 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -597
15/05/18 17:35:53 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:53 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0).
1800 bytes result sent to driver
15/05/18 17:35:53 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID
0) in 2782 ms on localhost (1/1)
15/05/18 17:35:53 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/05/18 17:35:53 INFO DAGScheduler: Stage 0 (runJob at
SparkPlan.scala:122) finished in 2.797 s
15/05/18 17:35:53 INFO DAGScheduler: Job 0 finished: runJob at
SparkPlan.scala:122, took 2.974724 s
15/05/18 17:35:53 INFO SparkContext: Starting job: runJob at
SparkPlan.scala:122
15/05/18 17:35:53 INFO DAGScheduler: Got job 1 (runJob at
SparkPlan.scala:122) with 596 output partitions (allowLocal=false)
15/05/18 17:35:53 INFO DAGScheduler: Final stage: Stage 1(runJob at
SparkPlan.scala:122)
15/05/18 17:35:53 INFO DAGScheduler: Parents of final stage: List()
15/05/18 17:35:53 INFO DAGScheduler: Missing parents: List()
15/05/18 17:35:53 INFO DAGScheduler: Submitting Stage 1
(MapPartitionsRDD[2] at map at SparkPlan.scala:97), which has no missing
parents
15/05/18 17:35:53 INFO MemoryStore: ensureFreeSpace(3448) called with
curMem=268193, maxMem=70177259
15/05/18 17:35:53 INFO MemoryStore: Block broadcast_2 stored as values in
memory (estimated size 3.4 KB, free 66.7 MB)
15/05/18 17:35:53 INFO MemoryStore: ensureFreeSpace(2386) called with
curMem=271641, maxMem=70177259
15/05/18 17:35:53 INFO MemoryStore: Block broadcast_2_piece0 stored as
bytes in memory (estimated size 2.3 KB, free 66.7 MB)
15/05/18 17:35:53 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory
on localhost:60317 (size: 2.3 KB, free: 66.9 MB)
15/05/18 17:35:53 INFO BlockManagerMaster: Updated info of block
broadcast_2_piece0
15/05/18 17:35:53 INFO SparkContext: Created broadcast 2 from broadcast at
DAGScheduler.scala:839
15/05/18 17:35:53 INFO DAGScheduler: Submitting 596 missing tasks from
Stage 1 (MapPartitionsRDD[2] at map at SparkPlan.scala:97)
15/05/18 17:35:53 INFO TaskSchedulerImpl: Adding task set 1.0 with 596 tasks
15/05/18 17:35:53 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID
1, localhost, PROCESS_LOCAL, 1306 bytes)
15/05/18 17:35:53 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/05/18 17:35:53 INFO HadoopRDD: Input split:
s3a://bucket-name/avro_data/episodes.avro:1+1
15/05/18 17:35:53 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:54 INFO S3AFileSystem: Opening
's3a://bucket-name/avro_data/episodes.avro' for reading
15/05/18 17:35:54 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:54 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:54 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -4
15/05/18 17:35:54 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:55 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -596
15/05/18 17:35:55 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 1
15/05/18 17:35:56 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1).
1800 bytes result sent to driver
15/05/18 17:35:56 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID
2, localhost, PROCESS_LOCAL, 1306 bytes)
15/05/18 17:35:56 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
15/05/18 17:35:56 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID
1) in 2224 ms on localhost (1/596)
15/05/18 17:35:56 INFO HadoopRDD: Input split:
s3a://bucket-name/avro_data/episodes.avro:2+1
15/05/18 17:35:56 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:56 INFO BlockManager: Removing broadcast 1
15/05/18 17:35:56 INFO BlockManager: Removing block broadcast_1_piece0
15/05/18 17:35:56 INFO MemoryStore: Block broadcast_1_piece0 of size 2386
dropped from memory (free 69905618)
15/05/18 17:35:56 INFO BlockManagerInfo: Removed broadcast_1_piece0 on
localhost:60317 in memory (size: 2.3 KB, free: 66.9 MB)
15/05/18 17:35:56 INFO BlockManagerMaster: Updated info of block
broadcast_1_piece0
15/05/18 17:35:56 INFO BlockManager: Removing block broadcast_1
15/05/18 17:35:56 INFO MemoryStore: Block broadcast_1 of size 3448 dropped
from memory (free 69909066)
15/05/18 17:35:56 INFO ContextCleaner: Cleaned broadcast 1
15/05/18 17:35:56 INFO S3AFileSystem: Opening
's3a://bucket-name/avro_data/episodes.avro' for reading
15/05/18 17:35:56 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:56 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:57 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -4
15/05/18 17:35:57 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
15/05/18 17:35:58 INFO S3AFileSystem: Reopening avro_data/episodes.avro to
seek to new offset -595
15/05/18 17:35:58 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 2
15/05/18 17:35:58 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2).
1800 bytes result sent to driver
15/05/18 17:35:58 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID
3, localhost, PROCESS_LOCAL, 1306 bytes)
15/05/18 17:35:58 INFO Executor: Running task 2.0 in stage 1.0 (TID 3)
15/05/18 17:35:58 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID
2) in 2655 ms on localhost (2/596)
15/05/18 17:35:58 INFO HadoopRDD: Input split:
s3a://bucket-name/avro_data/episodes.avro:3+1
15/05/18 17:35:58 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:58 INFO S3AFileSystem: Opening
's3a://bucket-name/avro_data/episodes.avro' for reading
15/05/18 17:35:58 INFO S3AFileSystem: Getting path status for
s3a://bucket-name/avro_data/episodes.avro (avro_data/episodes.avro)
15/05/18 17:35:59 INFO S3AFileSystem: Actually opening file
avro_data/episodes.avro at pos 0
.
.
.
.

And this is my code :

public static void main(String[] args) {

System.out.println("START...");
SparkConf conf = new
SparkConf().setAppName("DataFrameDemo").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
Configuration config = sc.hadoopConfiguration();
 //FOR s3a
config.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
config.set("fs.s3a.access.key","**********************");
config.set("fs.s3a.secret.key","***********************************");
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.load("s3a://bucket-name/avro_data/episodes.avro",
"com.databricks.spark.avro");
// DataFrame df = sqlContext.load("/Users/miqbal1/avro_data/episodes.avro",
"com.databricks.spark.avro");
df.show();
df.printSchema();
df.select("name").show();
System.out.println("DONE");
}

While the same code is working fine with a local file. Am I missing
something here? Any help would be highly appreciated.

Thank you.

[image: http://]
Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Sun, May 17, 2015 at 8:51 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> I think you can try this way also:
>
> DataFrame df = 
> sqlContext.load("s3n://ACCESS-KEY:SECRET-KEY@bucket-name/file.avro",
> "com.databricks.spark.avro");
>
>
> Thanks
> Best Regards
>
> On Sat, May 16, 2015 at 2:02 AM, Mohammad Tariq <donta...@gmail.com>
> wrote:
>
>> Thanks for the suggestion Steve. I'll try that out.
>>
>> Read the long story last night while struggling with this :). I made sure
>> that I don't have any '/' in my key.
>>
>> On Saturday, May 16, 2015, Steve Loughran <ste...@hortonworks.com> wrote:
>>
>>>
>>> > On 15 May 2015, at 21:20, Mohammad Tariq <donta...@gmail.com> wrote:
>>> >
>>> > Thank you Ayan and Ted for the prompt response. It isn't working with
>>> s3n either.
>>> >
>>> > And I am able to download the file. In fact I am able to read the same
>>> file using s3 API without any issue.
>>> >
>>>
>>>
>>> sounds like an S3n config problem. Check your configurations - you can
>>> test locally via the hdfs dfs command without even starting spark
>>>
>>>  Oh, and if there is a "/" in your secret key, you're going to to need
>>> to generate new one. Long story
>>>
>>
>>
>> --
>>
>> [image: http://]
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>>
>

Reply via email to