Hello,

I'm building a spark app required to read large amounts of log files from
s3. I'm doing so in the code by constructing the file list, and passing it
to the context as following:

val myRDD = sc.textFile("s3n://mybucket/file1, s3n://mybucket/file2, ... ,
s3n://mybucket/fileN")

When running it locally there are no issues, but when running it on the
yarn-cluster (running spark 1.1.0, hadoop 2.4), I'm seeing an inefficient
linear piece of code running, which could probably be easily parallelized:


[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file1

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file2

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/file3

....

[main] INFO  com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus
s3n://mybucket/fileN


I believe there are some difference between my local classpath and the
cluster's classpath - locally I see that
*org.apache.hadoop.fs.s3native.NativeS3FileSystem* is being used, whereas
on the cluster *com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem* is
being used. Any suggestions?


Thanks,

Tomer

Reply via email to