Thanks - this is very helpful! On Thu, Nov 27, 2014 at 5:20 AM, Michael Armbrust <mich...@databricks.com> wrote:
> In the past I have worked around this problem by avoiding sc.textFile(). > Instead I read the data directly inside of a Spark job. Basically, you > start with an RDD where each entry is a file in S3 and then flatMap that > with something that reads the files and returns the lines. > > Here's an example: https://gist.github.com/marmbrus/fff0b058f134fa7752fe > > Using this class you can do something like: > > sc.parallelize("s3n://mybucket/file1" :: "s3n://mybucket/file1" ... :: > Nil).flatMap(new ReadLinesSafe(_)) > > You can also build up the list of files by running a Spark job: > https://gist.github.com/marmbrus/15e72f7bc22337cf6653 > > Michael > > On Wed, Nov 26, 2014 at 9:23 AM, Aaron Davidson <ilike...@gmail.com> > wrote: > >> Spark has a known problem where it will do a pass of metadata on a large >> number of small files serially, in order to find the partition information >> prior to starting the job. This will probably not be repaired by switching >> the FS impl. >> >> However, you can change the FS being used like so (prior to the first >> usage): >> sc.hadoopConfiguration.set("fs.s3n.impl", >> "org.apache.hadoop.fs.s3native.NativeS3FileSystem") >> >> On Wed, Nov 26, 2014 at 1:47 AM, Tomer Benyamini <tomer....@gmail.com> >> wrote: >> >>> Thanks Lalit; Setting the access + secret keys in the configuration >>> works even when calling sc.textFile. Is there a way to select which hadoop >>> s3 native filesystem implementation would be used at runtime using the >>> hadoop configuration? >>> >>> Thanks, >>> Tomer >>> >>> On Wed, Nov 26, 2014 at 11:08 AM, lalit1303 <la...@sigmoidanalytics.com> >>> wrote: >>> >>>> >>>> you can try creating hadoop Configuration and set s3 configuration i.e. >>>> access keys etc. >>>> Now, for reading files from s3 use newAPIHadoopFile and pass the config >>>> object here along with key, value classes. >>>> >>>> >>>> >>>> >>>> >>>> ----- >>>> Lalit Yadav >>>> la...@sigmoidanalytics.com >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p19845.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >>> >> >