Nicholas, thanks for the tip. Your suggestion certainly seemed like the
right approach, but after a few days of fiddling I've come to the
conclusion that s3distcp will not work for my use case. It is unable to
flatten directory hierarchies, which I need because my source directories
contain hour/minute/second parts.

See https://forums.aws.amazon.com/message.jspa?messageID=478960. It seems
that s3distcp can only combine files in the same path.

Thanks again. That gave me a lot to go on. Any further suggestions?

L


On Thu, Oct 2, 2014 at 4:15 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> I believe this is known as the "Hadoop Small Files Problem", and it
> affects Spark as well. The best approach I've seen to merging small files
> like this is by using s3distcp, as suggested here
> <http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/>,
> as a pre-processing step.
>
> It would be great if Spark could somehow handle this common situation out
> of the box, but for now I don't think it does.
>
> Nick
>
> On Thu, Oct 2, 2014 at 7:10 PM, Landon Kuhn <lan...@janrain.com> wrote:
>
>> Hello, I'm trying to use Spark to process a large number of files in S3.
>> I'm running into an issue that I believe is related to the high number of
>> files, and the resources required to build the listing within the driver
>> program. If anyone in the Spark community can provide insight or guidance,
>> it would be greatly appreciated.
>>
>> The task at hand is to read ~100 million files stored in S3, and
>> repartition the data into a sensible number of files (perhaps 1,000). The
>> files are organized in a directory structure like so:
>>
>>
>> s3://bucket/event_type/year/month/day/hour/minute/second/customer_id/file_name
>>
>> (Note that each file is very small, containing 1-10 records each.
>> Unfortunately this is an artifact of the upstream systems that put data in
>> S3.)
>>
>> My Spark program is simple, and works when I target a relatively specific
>> subdirectory. For example:
>>
>>
>> sparkContext.textFile("s3n://bucket/purchase/2014/01/01/00/*/*/*/*").coalesce(...).write(...)
>>
>> This targets 1 hour's worth of purchase records, containing about 10,000
>> files. The driver program blocks (I assume it is making S3 calls to
>> traverse the directories), and during this time no activity is visible in
>> the driver UI. After about a minute, the stages and tasks allocate in the
>> UI, and then everything progresses and completes within a few minutes.
>>
>> I need to process all the data (several year's worth). Something like:
>>
>>
>> sparkContext.textFile("s3n://bucket/*/*/*/*/*/*/*/*/*").coalesce(...).write(...)
>>
>> This blocks "forever" (I have only run the program for as long as
>> overnight). The stages and tasks never appear in the UI. I assume Spark is
>> building the file listing, which will either take too long and/or cause the
>> driver to eventually run out of memory.
>>
>> I would appreciate any comments or suggestions. I'm happy to provide more
>> information if that would be helpful.
>>
>> Thanks
>>
>> Landon
>>
>>
>


-- 
*Landon Kuhn*, *Software Architect*, Janrain, Inc. <http://bit.ly/cKKudR>
E: lan...@janrain.com | M: 971-645-5501 | F: 888-267-9025
Follow Janrain: Facebook <http://bit.ly/9CGHdf> | Twitter
<http://bit.ly/9umxlK> | YouTube <http://bit.ly/N0OiBT> | LinkedIn
<http://bit.ly/a7WZMC> | Blog <http://bit.ly/OI2uOR>
Follow Me: LinkedIn <http://www.linkedin.com/in/landonkuhn>
-------------------------------------------------------------------------------------
*Acquire, understand, and engage your users. Watch our video
<http://bit.ly/janrain-overview> or sign up for a live demo
<http://bit.ly/janraindemo> to see what it's all about.*

Reply via email to