Hassan, How many of the cores is it using? IIRC, at default settings, Parquet partitions a 700MB file into 3 chunks. Therefore, we would expect a 1GB Parquet file to be split into 4 partitions, and therefore to use 4 cores. You can determine how many partitions Parquet has written your file in by doing an ls inside of the Parquet file.
If the processing is a bottleneck and you don’t mind incurring a shuffle, after you load the data from disk, you could do a coalesce on the RDD. In your case, you would want numPartitions to be at least the number of cores in your system (I think typically they recommend having 2-3 partitions per core for load balancing?), and shuffle=true. Else, I’d change the Parquet settings that you create the file with to write the file in more partitions. Regards, Frank Austin Nothaft [email protected] [email protected] 202-340-0466 On Feb 2, 2014, at 8:47 AM, Hassan Syed <[email protected]> wrote: > I know it is Sunday. But I would be eternally great full if someone could > help me sort out this issue. If I can't get spark working soon I am going to > have to do this processing on my laptop and i'd have to write a resumable > batch operation using a database to maintain state. > > Any of the top things to try would help. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1132.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
