Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Frank Austin Nothaft Sun, 02 Feb 2014 09:27:42 -0800

Hassan,

How many of the cores is it using? IIRC, at default settings, Parquet 
partitions a 700MB file into 3 chunks. Therefore, we would expect a 1GB Parquet 
file to be split into 4 partitions, and therefore to use 4 cores. You can 
determine how many partitions Parquet has written your file in by doing an ls 
inside of the Parquet file.

If the processing is a bottleneck and you don’t mind incurring a shuffle, after 
you load the data from disk, you could do a coalesce on the RDD. In your case, 
you would want numPartitions to be at least the number of cores in your system 
(I think typically they recommend having 2-3 partitions per core for load 
balancing?), and shuffle=true. Else, I’d change the Parquet settings that you 
create the file with to write the file in more partitions.

Regards,

Frank Austin Nothaft
[email protected]
[email protected]
202-340-0466

On Feb 2, 2014, at 8:47 AM, Hassan Syed <[email protected]> wrote:

> I know it is Sunday. But I would be eternally great full if someone could
> help me sort out this issue. If I can't get spark working soon I am going to
> have to do this processing on my laptop and i'd have to write a resumable
> batch operation using a database to maintain state.
> 
> Any of the top things to try would help. 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1132.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can't get a local job to parallelise (using 0.9.0 from git with parquet and avro)

Reply via email to