Re: Best practices: Parallelized write to / read from S3

Nicholas Chammas Mon, 31 Mar 2014 10:17:20 -0700

OK sweet. Thanks for walking me through that.

I wish this were StackOverflow so I could bestow some nice rep on all you
helpful people.



On Mon, Mar 31, 2014 at 1:06 PM, Aaron Davidson <ilike...@gmail.com> wrote:

> Note that you may have minSplits set to more than the number of cores in
> the cluster, and Spark will just run as many as possible at a time. This is
> better if certain nodes may be slow, for instance.
>
> In general, it is not necessarily the case that doubling the number of
> cores doing IO will double the throughput, because you could be saturating
> the throughput with fewer cores. However, S3 is odd in that each connection
> gets way less bandwidth than your network link can provide, and it does
> seem to scale linearly with the number of connections. So, yes, taking
> minSplits up to 4 (or higher) will likely result in a 2x performance
> improvement.
>
> saveAsTextFile() will use as many partitions (aka splits) as the RDD it's
> being called on. So for instance:
>
> sc.textFile(myInputFile, 15).map(lambda x: x +
> "!!!").saveAsTextFile(myOutputFile)
>
> will use 15 partitions to read the text file (i.e., up to 15 cores at a
> time) and then again to save back to S3.
>
>
>
> On Mon, Mar 31, 2014 at 9:46 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> So setting 
>> minSplits<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.context.SparkContext-class.html#textFile>
>>  will
>> set the parallelism on the read in SparkContext.textFile(), assuming I have
>> the cores in the cluster to deliver that level of parallelism. And if I
>> don't explicitly provide it, Spark will set the minSplits to 2.
>>
>> So for example, say I have a cluster with 4 cores total, and it takes 40
>> minutes to read a single file from S3 with minSplits at 2. Tt should take
>> roughly 20 minutes to read the same file if I up minSplits to 4.
>>
>> Did I understand that correctly?
>>
>> RDD.saveAsTextFile() doesn't have an analog to minSplits, so I'm guessing
>> that's not an operation the user can tune.
>>
>>
>> On Mon, Mar 31, 2014 at 12:29 PM, Aaron Davidson <ilike...@gmail.com>wrote:
>>
>>> Spark will only use each core for one task at a time, so doing
>>>
>>> sc.textFile(<s3 location>, <num reducers>)
>>>
>>> where you set "num reducers" to at least as many as the total number of
>>> cores in your cluster, is about as fast you can get out of the box. Same
>>> goes for saveAsTextFile.
>>>
>>>
>>> On Mon, Mar 31, 2014 at 8:49 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Howdy-doody,
>>>>
>>>> I have a single, very large file sitting in S3 that I want to read in
>>>> with sc.textFile(). What are the best practices for reading in this file as
>>>> quickly as possible? How do I parallelize the read as much as possible?
>>>>
>>>> Similarly, say I have a single, very large RDD sitting in memory that I
>>>> want to write out to S3 with RDD.saveAsTextFile(). What are the best
>>>> practices for writing this file out as quickly as possible?
>>>>
>>>> Nick
>>>>
>>>>
>>>> ------------------------------
>>>> View this message in context: Best practices: Parallelized write to /
>>>> read from 
>>>> S3<http://apache-spark-user-list.1001560.n3.nabble.com/Best-practices-Parallelized-write-to-read-from-S3-tp3516.html>
>>>> Sent from the Apache Spark User List mailing list 
>>>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>>>>
>>>
>>>
>>
>

Re: Best practices: Parallelized write to / read from S3

Reply via email to