Re: partitioning of small data sets

2014-04-15 Thread YouPeng Yang
Hi
  Actually,you can set the partition num by yourself to change the
'spark.default.parallelism' property .Otherwise,spark will use the default
partition defaultParallelism.

 For Local Model,the defaultParallelism = totalcores.
 For Local Cluster Model,the defaultParallelism=  math.max(totalcores, 2).

In  addition,for hadoopFile,the default partition min number is not the
same.
  def defaultMinSplits: Int = math.min(defaultParallelism, 2)



2014-04-16 5:54 GMT+08:00 Nicholas Chammas :

> Looking at the Python version of 
> textFile(),
> shouldn't it be "*max*(self.defaultParallelism, 2)"?
>
> If the default parallelism is, say 4, wouldn't we want to use that for
> minSplits instead of 2?
>
>
> On Tue, Apr 15, 2014 at 1:04 PM, Matei Zaharia wrote:
>
>> Yup, one reason it’s 2 actually is to give people a similar experience to
>> working with large files, in case their code doesn’t deal well with the
>> file being partitioned.
>>
>> Matei
>>
>> On Apr 15, 2014, at 9:53 AM, Aaron Davidson  wrote:
>>
>> Take a look at the minSplits argument for SparkContext#textFile [1] --
>> the default value is 2. You can simply set this to 1 if you'd prefer not to
>> split your data.
>>
>> [1]
>> http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
>>
>>
>> On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll wrote:
>>
>>> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
>>>
>>> Given the size, and that it is a single file, I assumed it would only be
>>> in a single partition.  But when I cache it,  I can see in the Spark App UI
>>> that it actually splits it into two partitions:
>>>
>>> 
>>>
>>> Is this correct behavior?  How does Spark decide how big a partition
>>> should be, or how many partitions to create for an RDD.
>>>
>>> If it matters, I have only a single worker in my "cluster", so both
>>> partitions are stored on the same worker.
>>>
>>> The file was on HDFS and was only a single block.
>>>
>>> Thanks for any insight.
>>>
>>> Diana
>>>
>>>
>>>
>>
>>
>


Re: partitioning of small data sets

2014-04-15 Thread Nicholas Chammas
Looking at the Python version of
textFile(),
shouldn't it be "*max*(self.defaultParallelism, 2)"?

If the default parallelism is, say 4, wouldn't we want to use that for
minSplits instead of 2?


On Tue, Apr 15, 2014 at 1:04 PM, Matei Zaharia wrote:

> Yup, one reason it’s 2 actually is to give people a similar experience to
> working with large files, in case their code doesn’t deal well with the
> file being partitioned.
>
> Matei
>
> On Apr 15, 2014, at 9:53 AM, Aaron Davidson  wrote:
>
> Take a look at the minSplits argument for SparkContext#textFile [1] -- the
> default value is 2. You can simply set this to 1 if you'd prefer not to
> split your data.
>
> [1]
> http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
>
>
> On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll wrote:
>
>> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
>>
>> Given the size, and that it is a single file, I assumed it would only be
>> in a single partition.  But when I cache it,  I can see in the Spark App UI
>> that it actually splits it into two partitions:
>>
>> 
>>
>> Is this correct behavior?  How does Spark decide how big a partition
>> should be, or how many partitions to create for an RDD.
>>
>> If it matters, I have only a single worker in my "cluster", so both
>> partitions are stored on the same worker.
>>
>> The file was on HDFS and was only a single block.
>>
>> Thanks for any insight.
>>
>> Diana
>>
>>
>>
>
>


Re: partitioning of small data sets

2014-04-15 Thread Matei Zaharia
Yup, one reason it’s 2 actually is to give people a similar experience to 
working with large files, in case their code doesn’t deal well with the file 
being partitioned.

Matei

On Apr 15, 2014, at 9:53 AM, Aaron Davidson  wrote:

> Take a look at the minSplits argument for SparkContext#textFile [1] -- the 
> default value is 2. You can simply set this to 1 if you'd prefer not to split 
> your data.
> 
> [1] 
> http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext
> 
> 
> On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll  wrote:
> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
> 
> Given the size, and that it is a single file, I assumed it would only be in a 
> single partition.  But when I cache it,  I can see in the Spark App UI that 
> it actually splits it into two partitions:
> 
> 
> 
> Is this correct behavior?  How does Spark decide how big a partition should 
> be, or how many partitions to create for an RDD.
> 
> If it matters, I have only a single worker in my "cluster", so both 
> partitions are stored on the same worker.
> 
> The file was on HDFS and was only a single block.
> 
> Thanks for any insight.
> 
> Diana
> 
> 
> 



Re: partitioning of small data sets

2014-04-15 Thread Aaron Davidson
Take a look at the minSplits argument for SparkContext#textFile [1] -- the
default value is 2. You can simply set this to 1 if you'd prefer not to
split your data.

[1]
http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext


On Tue, Apr 15, 2014 at 8:44 AM, Diana Carroll wrote:

> I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb
>
> Given the size, and that it is a single file, I assumed it would only be
> in a single partition.  But when I cache it,  I can see in the Spark App UI
> that it actually splits it into two partitions:
>
> [image: Inline image 1]
>
> Is this correct behavior?  How does Spark decide how big a partition
> should be, or how many partitions to create for an RDD.
>
> If it matters, I have only a single worker in my "cluster", so both
> partitions are stored on the same worker.
>
> The file was on HDFS and was only a single block.
>
> Thanks for any insight.
>
> Diana
>
>
>
<>

partitioning of small data sets

2014-04-15 Thread Diana Carroll
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb

Given the size, and that it is a single file, I assumed it would only be in
a single partition.  But when I cache it,  I can see in the Spark App UI
that it actually splits it into two partitions:

[image: Inline image 1]

Is this correct behavior?  How does Spark decide how big a partition should
be, or how many partitions to create for an RDD.

If it matters, I have only a single worker in my "cluster", so both
partitions are stored on the same worker.

The file was on HDFS and was only a single block.

Thanks for any insight.

Diana
<>