I am using globs though

raw = sc.textFile("/path/to/dir/*/*")

and I have tons of files so 1 file per partition should not be a problem.

On Mon, Oct 20, 2014 at 7:14 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> The biggest danger with gzipped files is this:
>
> >>> raw = sc.textFile("/path/to/file.gz", 8)>>> raw.getNumPartitions()1
>
> You think you’re telling Spark to parallelize the reads on the input, but
> Spark cannot parallelize reads against gzipped files. So 1 gzipped file
> gets assigned to 1 partition.
>
> It might be a nice user hint if Spark warned when parallelism is disabled
> by the input format.
>
> Nick
> ​
>
> On Mon, Oct 20, 2014 at 6:53 PM, Daniel Mahler <dmah...@gmail.com> wrote:
>
>> Hi Nicholas,
>>
>> Gzipping is a an impressive guess! Yes, they are.
>> My data sets are too large to make repartitioning viable, but I could try
>> it on a subset.
>> I generally have many more partitions than cores.
>> This was happenning before I started setting those configs.
>>
>> thanks
>> Daniel
>>
>>
>> On Mon, Oct 20, 2014 at 5:37 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Are you dealing with gzipped files by any chance? Does explicitly
>>> repartitioning your RDD to match the number of cores in your cluster help
>>> at all? How about if you don't specify the configs you listed and just go
>>> with defaults all around?
>>>
>>> On Mon, Oct 20, 2014 at 5:22 PM, Daniel Mahler <dmah...@gmail.com>
>>> wrote:
>>>
>>>> I launch the cluster using vanilla spark-ec2 scripts.
>>>> I just specify the number of slaves and instance type
>>>>
>>>> On Mon, Oct 20, 2014 at 4:07 PM, Daniel Mahler <dmah...@gmail.com>
>>>> wrote:
>>>>
>>>>> I usually run interactively from the spark-shell.
>>>>> My data definitely has more than enough partitions to keep all the
>>>>> workers busy.
>>>>> When I first launch the cluster I first do:
>>>>>
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> cat <<EOF >>~/spark/conf/spark-defaults.conf
>>>>> spark.serializer        org.apache.spark.serializer.KryoSerializer
>>>>> spark.rdd.compress      true
>>>>> spark.shuffle.consolidateFiles  true
>>>>> spark.akka.frameSize  20
>>>>> EOF
>>>>>
>>>>> copy-dir /root/spark/conf
>>>>> spark/sbin/stop-all.sh
>>>>> sleep 5
>>>>> spark/sbin/start-all.sh
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>> before starting the spark-shell or running any jobs.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 20, 2014 at 2:57 PM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> Perhaps your RDD is not partitioned enough to utilize all the cores
>>>>>> in your system.
>>>>>>
>>>>>> Could you post a simple code snippet and explain what kind of
>>>>>> parallelism you are seeing for it? And can you report on how many
>>>>>> partitions your RDDs have?
>>>>>>
>>>>>> On Mon, Oct 20, 2014 at 3:53 PM, Daniel Mahler <dmah...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> I am launching EC2 clusters using the spark-ec2 scripts.
>>>>>>> My understanding is that this configures spark to use the available
>>>>>>> resources.
>>>>>>> I can see that spark will use the available memory on larger istance
>>>>>>> types.
>>>>>>> However I have never seen spark running at more than 400% (using
>>>>>>> 100% on 4 cores)
>>>>>>> on machines with many more cores.
>>>>>>> Am I misunderstanding the docs? Is it just that high end ec2
>>>>>>> instances get I/O starved when running spark? It would be strange if 
>>>>>>> that
>>>>>>> consistently produced a 400% hard limit though.
>>>>>>>
>>>>>>> thanks
>>>>>>> Daniel
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to