I have run Hadoop + spark jobs on large s3n files without an issue. That being said if you have very large files you might want to consider using s3:// instead, as that uses a HDFS block format compatible storage which means you can more effectively split your large file between map tasks.
In my experience I also had reliability issues with jobs failing due to read problems when using s3n with large files. These issues went away when switching to s3://. The downside of course is that you can no longer view files written with s3:// in the AWS console which means you need to use a HDFS compatible viewing tool such as the hdfs command line utility. -Ryan On Sun, Jan 26, 2014 at 7:58 PM, Ognen Duzlevski <[email protected]>wrote: > I have run Spark jobs on multiple 20GB+ files (groupByKey() on filtered > contents of these files) via s3n:// and it all worked. Well, if you > consider taking forever to read in 20GB worth of a file over a network > connection (which is the limiting factor in this scenario) as "worked". > > I quickly realized that the best thing is to set up a Hadoop cluster (I > have a name node running with a bunch of data nodes on the same nodes as > the Spark cluster) using the ephemeral space on each node for speed. > Running the same jobs on the same 20GB files in this setup is factors > faster than over s3n, I am talking a few seconds to read in the files in a > 16 node cluster. > > You can pick the m1.xlarge instance for this (or any other instance that > offers lots of ephemeral disk space), it comes with 1.6TB of ephemeral > disks in 4x400GB partitions - you can put these in a RAID0 stripe > configuration to create one device you can put in your HDFS pool. If you > take a 10+ node cluster - this adds up to quite a lot of local space. If a > machine goes down the ephemeral space goes with it but you can set the > replication factor in Hadoop so you are covered. Of course I do not rely on > the ephemeral space for real persistence but for transient calculations it > is great as cache for jobs that you would otherwise run on S3 or EBS. > > HDFS is one of the rare really free distributed parallel filesystems out > there. I did not have the time to spend 3 months learning how Lustre works > ;) or the money to pay IBM for GPFS so the only thing really left is HDFS. > > Ognen > > > On Sun, Jan 26, 2014 at 8:18 PM, kamatsuoka <[email protected]> wrote: > >> The hadoop docs about s3 <http://wiki.apache.org/hadoop/AmazonS3> >> (linked >> to by the Spark docs) say that s3n files are subject to "the 5GB limit on >> file size imposed by S3." However, limit was raised >> < >> http://www.computerworld.com/s/article/9200763/Amazon_s_S3_can_now_store_files_of_up_to_5TB >> > >> about three years ago. So it wasn't clear to me whether this limit still >> applies to Hadoops s3n urls. >> >> Well, I tried running a spark job on a 200GB s3n file, and it ran fine. >> Has >> this been other people's experience? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/s3n-5GB-tp943.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >
