Re: Best practices: Parallelized write to / read from S3

Nicholas Chammas Mon, 31 Mar 2014 09:48:06 -0700

So setting 
minSplits<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.context.SparkContext-class.html#textFile>
will
set the parallelism on the read in SparkContext.textFile(), assuming I have
the cores in the cluster to deliver that level of parallelism. And if I
don't explicitly provide it, Spark will set the minSplits to 2.


So for example, say I have a cluster with 4 cores total, and it takes 40
minutes to read a single file from S3 with minSplits at 2. Tt should take
roughly 20 minutes to read the same file if I up minSplits to 4.

Did I understand that correctly?

RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm guessing
that's not an operation the user can tune.


On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Spark will only use each core for one task at a time, so doing
>
> sc.textFile(<s3 location>, <num reducers>)
>
> where you set "num reducers" to at least as many as the total number of
> cores in your cluster, is about as fast you can get out of the box. Same
> goes for saveAsTextFile.
>
>
> On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Howdy-doody,
>>
>> I have a single, very large file sitting in S3 that I want to read in
>> with sc.textFile(). What are the best practices for reading in this file as
>> quickly as possible? How do I parallelize the read as much as possible?
>>
>> Similarly, say I have a single, very large RDD sitting in memory that I
>> want to write out to S3 with RDD.saveAsTextFile(). What are the best
>> practices for writing this file out as quickly as possible?
>>
>> Nick
>>
>>
>> ------------------------------
>> View this message in context: Best practices: Parallelized write to /
>> read from 
>> S3<http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-Parallelized-write-to-read-from-S3-tp3516.html>
>> Sent from the Apache Spark User List mailing list 
>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>
>
>

Re: Best practices: Parallelized write to / read from S3

Reply via email to