Many thanks for replying.

Note I am not running HDFS on my laptop, but I am using the local
filesystem.

I am seeing this from the console output : 

14/02/02 17:47:41 INFO rdd.NewHadoopRDD: Input split:
ParquetInputSplit{part:
file:///Users/hassan/code/scala/avro/forum_dataset.parq start: 0 length:
1023817737 hosts: [localhost] blocks: 4 requestedSchema: same as file
fileSchema: message forumavroschema.Topic 

So I guess there are indeed 4 partitions. I am seeing only a single core
being used, and only the driver shows up in the web console as an executor.
Is this because of me not using HDFS ? 

I had a hunch that the block size was not being picked up for some reason so
I tried repartition(16) on the input RDD, And from the spew on the console
it seems now that after the partition at least the work is being delegated.
However, I do not think kryo can serialise avro objects without me writing
some serialization methods :( as the job produces no output now. 

How do you advise I proceed. Should I continue using avro/parquet or switch
to something else ? And do I need to set up HDFS on my laptop ?

Kind Regards

Hassan





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1135.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to