Hello, I'm building a spark app required to read large amounts of log files from s3. I'm doing so in the code by constructing the file list, and passing it to the context as following:
val myRDD = sc.textFile("s3n://mybucket/file1, s3n://mybucket/file2, ... , s3n://mybucket/fileN") When running it locally there are no issues, but when running it on the yarn-cluster (running spark 1.1.0, hadoop 2.4), I'm seeing an inefficient linear piece of code running, which could probably be easily parallelized: [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus s3n://mybucket/file1 [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus s3n://mybucket/file2 [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus s3n://mybucket/file3 .... [main] INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem- listStatus s3n://mybucket/fileN I believe there are some difference between my local classpath and the cluster's classpath - locally I see that *org.apache.hadoop.fs.s3native.NativeS3FileSystem* is being used, whereas on the cluster *com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem* is being used. Any suggestions? Thanks, Tomer