everything works best if your sources are a few tens to hundreds of MB or more
Are you referring to the size of the zip file or individual unzipped files? Any issues with storing a 60 mb zipped file containing heaps of text files inside? On 11 Apr. 2017 9:09 pm, "Steve Loughran" <ste...@hortonworks.com> wrote: > > > On 11 Apr 2017, at 11:07, Zeming Yu <zemin...@gmail.com> wrote: > > > > Hi all, > > > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > > > > Background: I have a data set growing by 6 TB p.a. I plan to use spark > to read in all the data, manipulate it and build a predictive model on it > (say GBM) I plan to store the data in S3, and use EMR to launch spark, > reading in data from S3. > > > > 1. Which option is best for storing the data on S3 for the purpose of > analysing it in EMR spark? > > Option A: storing the 6TB file as 173 million individual text files > > Option B: zipping up the above 173 million text files as 240,000 zip > files > > Option C: appending the individual text files, so have 240,000 text > files p.a. > > Option D: combining the text files even further > > > > everything works best if your sources are a few tens to hundreds of MB or > more of your data, work can be partitioned up by file. If you use more > structured formats (avro compressed with snappy, etc), you can throw > 1 > executor at work inside a file. Structure is handy all round, even if its > just adding timestamp and provenance columns to each data file. > > there's the HAR file format from Hadoop which can merge lots of small > files into larger ones, allowing work to be scheduled per har file. > Recommended for HDFS as it hates small files, on S3 you still have limits > on small files (including throttling of HTTP requests to shards of a > bucket), but they are less significant. > > One thing to be aware is that the s3 clients spark use are very > inefficient in listing wide directory trees, and Spark not always the best > at partitioning work because of this. You can accidentally create a very > inefficient tree structure like datasets/year=2017/month=5/day=10/hour=12/, > with only one file per hour. Listing and partitioning suffers here, and > while s3a on Hadoop 2.8 is better here, Spark hasn't yet fully adapted to > those changes (use of specific API calls). There's also a lot more to be > done in S3A to handle wildcards in the directory tree much more efficiently > (HADOOP-13204); needs to address pattens like > (datasets/year=201?/month=*/day=10) > without treewalking and without fetching too much data from wildcards near > the top of the tree. We need to avoid implementing something which works > well on *my* layouts, but absolutely dies on other people's. As is usual in > OSS, help welcome; early testing here as critical as coding, so as to > ensure things will work with your file structures > > -Steve > > > > 2. Any recommendations on the EMR set up to analyse the 6TB of data all > at once and build a GBM, in terms of > > 1) The type of EC2 instances I need? > > 2) The number of such instances I need? > > 3) Rough estimate of cost? > > > > no opinion there > > > > > Thanks so much, > > Zeming > > > >