Note that you may have minSplits set to more than the number of cores in
the cluster, and Spark will just run as many as possible at a time. This is
better if certain nodes may be slow, for instance.

In general, it is not necessarily the case that doubling the number of
cores doing IO will double the throughput, because you could be saturating
the throughput with fewer cores. However, S3 is odd in that each connection
gets way less bandwidth than your network link can provide, and it does
seem to scale linearly with the number of connections. So, yes, taking
minSplits up to 4 (or higher) will likely result in a 2x performance
improvement.

saveAsTextFile() will use as many partitions (aka splits) as the RDD it's
being called on. So for instance:

sc.textFile(myInputFile, 15).map(lambda x: x +
"!!!").saveAsTextFile(myOutputFile)

will use 15 partitions to read the text file (i.e., up to 15 cores at a
time) and then again to save back to S3.



On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> So setting 
> minSplits<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.context.SparkContext-class.html#textFile>
>  will
> set the parallelism on the read in SparkContext.textFile(), assuming I have
> the cores in the cluster to deliver that level of parallelism. And if I
> don't explicitly provide it, Spark will set the minSplits to 2.
>
> So for example, say I have a cluster with 4 cores total, and it takes 40
> minutes to read a single file from S3 with minSplits at 2. Tt should take
> roughly 20 minutes to read the same file if I up minSplits to 4.
>
> Did I understand that correctly?
>
> RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm guessing
> that's not an operation the user can tune.
>
>
> On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson <ilike...@gmail.com>wrote:
>
>> Spark will only use each core for one task at a time, so doing
>>
>> sc.textFile(<s3 location>, <num reducers>)
>>
>> where you set "num reducers" to at least as many as the total number of
>> cores in your cluster, is about as fast you can get out of the box. Same
>> goes for saveAsTextFile.
>>
>>
>> On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Howdy-doody,
>>>
>>> I have a single, very large file sitting in S3 that I want to read in
>>> with sc.textFile(). What are the best practices for reading in this file as
>>> quickly as possible? How do I parallelize the read as much as possible?
>>>
>>> Similarly, say I have a single, very large RDD sitting in memory that I
>>> want to write out to S3 with RDD.saveAsTextFile(). What are the best
>>> practices for writing this file out as quickly as possible?
>>>
>>> Nick
>>>
>>>
>>> ------------------------------
>>> View this message in context: Best practices: Parallelized write to /
>>> read from 
>>> S3<http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-Parallelized-write-to-read-from-S3-tp3516.html>
>>> Sent from the Apache Spark User List mailing list 
>>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>>
>>
>>
>

Reply via email to