Re: partitioning of small data sets

2014-04-15 Thread YouPeng Yang
Hi Actually,you can set the partition num by yourself to change the 'spark.default.parallelism' property .Otherwise,spark will use the default partition defaultParallelism. For Local Model,the defaultParallelism = totalcores. For Local Cluster Model,the defaultParallelism= math.max(totalcores

Re: partitioning of small data sets

2014-04-15 Thread Nicholas Chammas
Looking at the Python version of textFile(), shouldn't it be "*max*(self.defaultParallelism, 2)"? If the default parallelism is, say 4, wouldn't we want to use that for minSplits instead of 2? On Tu

Re: partitioning of small data sets

2014-04-15 Thread Matei Zaharia
Yup, one reason it’s 2 actually is to give people a similar experience to working with large files, in case their code doesn’t deal well with the file being partitioned. Matei On Apr 15, 2014, at 9:53 AM, Aaron Davidson wrote: > Take a look at the minSplits argument for SparkContext#textFile

Re: partitioning of small data sets

2014-04-15 Thread Aaron Davidson
Take a look at the minSplits argument for SparkContext#textFile [1] -- the default value is 2. You can simply set this to 1 if you'd prefer not to split your data. [1] http://spark.apache.org/docs/latest/api/core/index.html#org.apache.spark.SparkContext On Tue, Apr 15, 2014 at 8:44 AM, Diana Car

partitioning of small data sets

2014-04-15 Thread Diana Carroll
I loaded a very tiny file into Spark -- 23 lines of text, 2.6kb Given the size, and that it is a single file, I assumed it would only be in a single partition. But when I cache it, I can see in the Spark App UI that it actually splits it into two partitions: [image: Inline image 1] Is this cor