Yes, it does read it in parallel, based on the input format you have (e.g. text file, SequenceFile, etc). By default it uses 32 MB blocks. All of this is just going through Hadoop's S3 library so anything that Hadoop does can be done here.
Matei On Oct 23, 2013, at 6:36 PM, Ankur Chauhan <[email protected]> wrote: > Just a follow up question. How does the spark task/job/master know how to > split the file that is in s3. In most cases, it would be better to fetch > different parts of the file in parallel. Is that something that is done by > the workers? > > On Oct 23, 2013, at 18:28, Ayush Mishra <[email protected]> wrote: > >> You can check >> http://blog.knoldus.com/2013/09/09/running-standalone-scala-job-on-amazon-ec2-spark-cluster/. >> >> >> On Thu, Oct 24, 2013 at 6:54 AM, Nan Zhu <[email protected]> wrote: >> Great!!! >> >> >> On Wed, Oct 23, 2013 at 9:21 PM, Matei Zaharia <[email protected]> >> wrote: >> Yes, take a look at >> http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3 >> >> Matei >> >> >> On Oct 23, 2013, at 6:17 PM, Nan Zhu <[email protected]> wrote: >> >>> Hi, all >>> >>> Is there any solution running Spark with Amazon S3? >>> >>> Best, >>> >>> Nan >> >> >>
