Re: partitioning of small data sets

YouPeng Yang Tue, 15 Apr 2014 18:15:07 -0700

Hi
  Actually,you can set the partition num by yourself to change the
'spark.default.parallelism' property .Otherwise,spark will use the default
partition defaultParallelism.


 For Local Model,the defaultParallelism = totalcores.
 For Local Cluster Model,the defaultParallelism=  math.max(totalcores, 2).

In  addition,for hadoopFile,the default partition min number is not the
same.
  def defaultMinSplits: Int = math.min(defaultParallelism, 2)



2014-04-16 5:54 GMT+08:00 Nicholas Chammas <nicholas.cham...@gmail.com>:

> Looking at the Python version of 
> textFile()<http://spark.apache.org/docs/latest/api/pyspark/pyspark.context-pysrc.html#SparkContext.textFile>,
> shouldn't it be "*max*(self.defaultParallelism, 2)"?
>
> If the default parallelism is, say 4, wouldn't we want to use that for
> minSplits instead of 2?
>
>
> On Tue, Apr 15, 2014 at 1:04 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:
>
>> Yup, one reason it’s 2 actually is to give people a similar experience to
>> working with large files, in case their code doesn’t deal well with the
>> file being partitioned.
>>
>> Matei
>>
>> On Apr 15, 2014, at 9:53 AM, Aaron Davidson <ilike...@gmail.com> wrote:
>>
>> Take a look at the minSplits argument for SparkContext#textFile [1] --
>> the default value is 2. You can simply set this to 1 if you'd prefer not to
>> split your data.
>>
>> [1]
>> http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
>>
>>
>> On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll <dcarr...@cloudera.com>wrote:
>>
>>> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
>>>
>>> Given the size, and that it is a single file, I assumed it would only be
>>> in a single partition.  But when I cache it,  I can see in the Spark App UI
>>> that it actually splits it into two partitions:
>>>
>>> <sparkdev_2014-04-11.png>
>>>
>>> Is this correct behavior?  How does Spark decide how big a partition
>>> should be, or how many partitions to create for an RDD.
>>>
>>> If it matters, I have only a single worker in my "cluster", so both
>>> partitions are stored on the same worker.
>>>
>>> The file was on HDFS and was only a single block.
>>>
>>> Thanks for any insight.
>>>
>>> Diana
>>>
>>>
>>>
>>
>>
>

Re: partitioning of small data sets

Reply via email to