Re: Strategies for reading large numbers of files

Matei Zaharia Mon, 06 Oct 2014 14:08:03 -0700

The problem is that listing the metadata for all these files in S3 takes a long 
time. Something you can try is the following: split your files into several 
non-overlapping paths (e.g. s3n://bucket/purchase/2014/01, 
s3n://bucket/purchase/2014/02, etc), then do sc.parallelize over a list of such 
path, and in each task use a single-node S3 library to list the contents of 
that directory only and read them. You can use Hadoop's FileSystem class for 
example (FileSystem.open("s3n://...") or something like that). That way more 
nodes will be querying the metadata for these in parallel.


Matei

On Oct 6, 2014, at 12:59 PM, Nicholas Chammas <nicholas.cham...@gmail.com> 
wrote:

> Unfortunately not. Again, I wonder if adding support targeted at this "small 
> files problem" would make sense for Spark core, as it is a common problem in 
> our space.
> 
> Right now, I don't know of any other options.
> 
> Nick
> 
> 
> On Mon, Oct 6, 2014 at 2:24 PM, Landon Kuhn <lan...@janrain.com> wrote:
> Nicholas, thanks for the tip. Your suggestion certainly seemed like the right 
> approach, but after a few days of fiddling I've come to the conclusion that 
> s3distcp will not work for my use case. It is unable to flatten directory 
> hierarchies, which I need because my source directories contain 
> hour/minute/second parts.
> 
> See https://forums.aws.amazon.com/message.jspa?messageID=478960. It seems 
> that s3distcp can only combine files in the same path.
> 
> Thanks again. That gave me a lot to go on. Any further suggestions?
> 
> L
> 
> 
> On Thu, Oct 2, 2014 at 4:15 PM, Nicholas Chammas <nicholas.cham...@gmail.com> 
> wrote:
> I believe this is known as the "Hadoop Small Files Problem", and it affects 
> Spark as well. The best approach I've seen to merging small files like this 
> is by using s3distcp, as suggested here, as a pre-processing step.
> 
> It would be great if Spark could somehow handle this common situation out of 
> the box, but for now I don't think it does.
> 
> Nick
> 
> On Thu, Oct 2, 2014 at 7:10 PM, Landon Kuhn <lan...@janrain.com> wrote:
> Hello, I'm trying to use Spark to process a large number of files in S3. I'm 
> running into an issue that I believe is related to the high number of files, 
> and the resources required to build the listing within the driver program. If 
> anyone in the Spark community can provide insight or guidance, it would be 
> greatly appreciated.
> 
> The task at hand is to read ~100 million files stored in S3, and repartition 
> the data into a sensible number of files (perhaps 1,000). The files are 
> organized in a directory structure like so:
> 
>     
> s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name
> 
> (Note that each file is very small, containing 1-10 records each. 
> Unfortunately this is an artifact of the upstream systems that put data in 
> S3.)
> 
> My Spark program is simple, and works when I target a relatively specific 
> subdirectory. For example:
> 
>     
> sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...)
> 
> This targets 1 hour's worth of purchase records, containing about 10,000 
> files. The driver program blocks (I assume it is making S3 calls to traverse 
> the directories), and during this time no activity is visible in the driver 
> UI. After about a minute, the stages and tasks allocate in the UI, and then 
> everything progresses and completes within a few minutes.
> 
> I need to process all the data (several year's worth). Something like:
> 
>   
> sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...)
> 
> This blocks "forever" (I have only run the program for as long as overnight). 
> The stages and tasks never appear in the UI. I assume Spark is building the 
> file listing, which will either take too long and/or cause the driver to 
> eventually run out of memory.
> 
> I would appreciate any comments or suggestions. I'm happy to provide more 
> information if that would be helpful.
> 
> Thanks
> 
> Landon
> 
> 
> 
> 
> 
> -- 
> Landon Kuhn, Software Architect, Janrain, Inc.
> E: lan...@janrain.com | M: 971-645-5501 | F: 888-267-9025
> Follow Janrain: Facebook | Twitter | YouTube | LinkedIn | Blog
> Follow Me: LinkedIn
> -------------------------------------------------------------------------------------
> Acquire, understand, and engage your users. Watch our video or sign up for a 
> live demo to see what it's all about.
>

Re: Strategies for reading large numbers of files

Reply via email to