Nicholas, thanks for the tip. Your suggestion certainly seemed like the right approach, but after a few days of fiddling I've come to the conclusion that s3distcp will not work for my use case. It is unable to flatten directory hierarchies, which I need because my source directories contain hour/minute/second parts.
See https://forums.aws.amazon.com/message.jspa?messageID=478960. It seems that s3distcp can only combine files in the same path. Thanks again. That gave me a lot to go on. Any further suggestions? L On Thu, Oct 2, 2014 at 4:15 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > I believe this is known as the "Hadoop Small Files Problem", and it > affects Spark as well. The best approach I've seen to merging small files > like this is by using s3distcp, as suggested here > <http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/>, > as a pre-processing step. > > It would be great if Spark could somehow handle this common situation out > of the box, but for now I don't think it does. > > Nick > > On Thu, Oct 2, 2014 at 7:10 PM, Landon Kuhn <lan...@janrain.com> wrote: > >> Hello, I'm trying to use Spark to process a large number of files in S3. >> I'm running into an issue that I believe is related to the high number of >> files, and the resources required to build the listing within the driver >> program. If anyone in the Spark community can provide insight or guidance, >> it would be greatly appreciated. >> >> The task at hand is to read ~100 million files stored in S3, and >> repartition the data into a sensible number of files (perhaps 1,000). The >> files are organized in a directory structure like so: >> >> >> s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name >> >> (Note that each file is very small, containing 1-10 records each. >> Unfortunately this is an artifact of the upstream systems that put data in >> S3.) >> >> My Spark program is simple, and works when I target a relatively specific >> subdirectory. For example: >> >> >> sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...) >> >> This targets 1 hour's worth of purchase records, containing about 10,000 >> files. The driver program blocks (I assume it is making S3 calls to >> traverse the directories), and during this time no activity is visible in >> the driver UI. After about a minute, the stages and tasks allocate in the >> UI, and then everything progresses and completes within a few minutes. >> >> I need to process all the data (several year's worth). Something like: >> >> >> sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...) >> >> This blocks "forever" (I have only run the program for as long as >> overnight). The stages and tasks never appear in the UI. I assume Spark is >> building the file listing, which will either take too long and/or cause the >> driver to eventually run out of memory. >> >> I would appreciate any comments or suggestions. I'm happy to provide more >> information if that would be helpful. >> >> Thanks >> >> Landon >> >> > -- *Landon Kuhn*, *Software Architect*, Janrain, Inc. <http://bit.ly/cKKudR> E: lan...@janrain.com | M: 971-645-5501 | F: 888-267-9025 Follow Janrain: Facebook <http://bit.ly/9CGHdf> | Twitter <http://bit.ly/9umxlK> | YouTube <http://bit.ly/N0OiBT> | LinkedIn <http://bit.ly/a7WZMC> | Blog <http://bit.ly/OI2uOR> Follow Me: LinkedIn <http://www.linkedin.com/in/landonkuhn> ------------------------------------------------------------------------------------- *Acquire, understand, and engage your users. Watch our video <http://bit.ly/janrain-overview> or sign up for a live demo <http://bit.ly/janraindemo> to see what it's all about.*