I believe this is known as the "Hadoop Small Files Problem", and it affects
Spark as well. The best approach I've seen to merging small files like this
is by using s3distcp, as suggested here
<http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/>,
as a pre-processing step.

It would be great if Spark could somehow handle this common situation out
of the box, but for now I don't think it does.

Nick

On Thu, Oct 2, 2014 at 7:10 PM, Landon Kuhn <lan...@janrain.com> wrote:

> Hello, I'm trying to use Spark to process a large number of files in S3.
> I'm running into an issue that I believe is related to the high number of
> files, and the resources required to build the listing within the driver
> program. If anyone in the Spark community can provide insight or guidance,
> it would be greatly appreciated.
>
> The task at hand is to read ~100 million files stored in S3, and
> repartition the data into a sensible number of files (perhaps 1,000). The
> files are organized in a directory structure like so:
>
>
> s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name
>
> (Note that each file is very small, containing 1-10 records each.
> Unfortunately this is an artifact of the upstream systems that put data in
> S3.)
>
> My Spark program is simple, and works when I target a relatively specific
> subdirectory. For example:
>
>
> sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...)
>
> This targets 1 hour's worth of purchase records, containing about 10,000
> files. The driver program blocks (I assume it is making S3 calls to
> traverse the directories), and during this time no activity is visible in
> the driver UI. After about a minute, the stages and tasks allocate in the
> UI, and then everything progresses and completes within a few minutes.
>
> I need to process all the data (several year's worth). Something like:
>
>
> sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...)
>
> This blocks "forever" (I have only run the program for as long as
> overnight). The stages and tasks never appear in the UI. I assume Spark is
> building the file listing, which will either take too long and/or cause the
> driver to eventually run out of memory.
>
> I would appreciate any comments or suggestions. I'm happy to provide more
> information if that would be helpful.
>
> Thanks
>
> Landon
>
>

Reply via email to