The problem is that listing the metadata for all these files in S3 takes a long time. Something you can try is the following: split your files into several non-overlapping paths (e.g. s3n://bucket/purchase/2014/01, s3n://bucket/purchase/2014/02, etc), then do sc.parallelize over a list of such path, and in each task use a single-node S3 library to list the contents of that directory only and read them. You can use Hadoop's FileSystem class for example (FileSystem.open("s3n://...") or something like that). That way more nodes will be querying the metadata for these in parallel.
Matei On Oct 6, 2014, at 12:59 PM, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Unfortunately not. Again, I wonder if adding support targeted at this "small > files problem" would make sense for Spark core, as it is a common problem in > our space. > > Right now, I don't know of any other options. > > Nick > > > On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn <lan...@janrain.com> wrote: > Nicholas, thanks for the tip. Your suggestion certainly seemed like the right > approach, but after a few days of fiddling I've come to the conclusion that > s3distcp will not work for my use case. It is unable to flatten directory > hierarchies, which I need because my source directories contain > hour/minute/second parts. > > See https://forums.aws.amazon.com/message.jspa?messageID=478960. It seems > that s3distcp can only combine files in the same path. > > Thanks again. That gave me a lot to go on. Any further suggestions? > > L > > > On Thu, Oct 2, 2014 at 4:15 PM, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > I believe this is known as the "Hadoop Small Files Problem", and it affects > Spark as well. The best approach I've seen to merging small files like this > is by using s3distcp, as suggested here, as a pre-processing step. > > It would be great if Spark could somehow handle this common situation out of > the box, but for now I don't think it does. > > Nick > > On Thu, Oct 2, 2014 at 7:10 PM, Landon Kuhn <lan...@janrain.com> wrote: > Hello, I'm trying to use Spark to process a large number of files in S3. I'm > running into an issue that I believe is related to the high number of files, > and the resources required to build the listing within the driver program. If > anyone in the Spark community can provide insight or guidance, it would be > greatly appreciated. > > The task at hand is to read ~100 million files stored in S3, and repartition > the data into a sensible number of files (perhaps 1,000). The files are > organized in a directory structure like so: > > > s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name > > (Note that each file is very small, containing 1-10 records each. > Unfortunately this is an artifact of the upstream systems that put data in > S3.) > > My Spark program is simple, and works when I target a relatively specific > subdirectory. For example: > > > sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...) > > This targets 1 hour's worth of purchase records, containing about 10,000 > files. The driver program blocks (I assume it is making S3 calls to traverse > the directories), and during this time no activity is visible in the driver > UI. After about a minute, the stages and tasks allocate in the UI, and then > everything progresses and completes within a few minutes. > > I need to process all the data (several year's worth). Something like: > > > sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...) > > This blocks "forever" (I have only run the program for as long as overnight). > The stages and tasks never appear in the UI. I assume Spark is building the > file listing, which will either take too long and/or cause the driver to > eventually run out of memory. > > I would appreciate any comments or suggestions. I'm happy to provide more > information if that would be helpful. > > Thanks > > Landon > > > > > > -- > Landon Kuhn, Software Architect, Janrain, Inc. > E: lan...@janrain.com | M: 971-645-5501 | F: 888-267-9025 > Follow Janrain: Facebook | Twitter | YouTube | LinkedIn | Blog > Follow Me: LinkedIn > ------------------------------------------------------------------------------------- > Acquire, understand, and engage your users. Watch our video or sign up for a > live demo to see what it's all about. >