Hassan, I’m not sure why only a single core is being used to process 4 partitions. It shouldn’t have anything to do with not using HDFS, but that’s pure conjecture on my part.
RE: Kryo serialization, Matt Massie has a good blog post on using Parquet and Avro with Spark. Additionally, here is a link towards source that is used in a project to register Avro with Kryo. Regards, Frank Austin Nothaft [email protected] [email protected] 202-340-0466 On Feb 2, 2014, at 10:01 AM, Hassan Syed <[email protected]> wrote: > Many thanks for replying. > > Note I am not running HDFS on my laptop, but I am using the local > filesystem. > > I am seeing this from the console output : > > 14/02/02 17:47:41 INFO rdd.NewHadoopRDD: Input split: > ParquetInputSplit{part: > file:///Users/hassan/code/scala/avro/forum_dataset.parq start: 0 length: > 1023817737 hosts: [localhost] blocks: 4 requestedSchema: same as file > fileSchema: message forumavroschema.Topic > > So I guess there are indeed 4 partitions. I am seeing only a single core > being used, and only the driver shows up in the web console as an executor. > Is this because of me not using HDFS ? > > I had a hunch that the block size was not being picked up for some reason so > I tried repartition(16) on the input RDD, And from the spew on the console > it seems now that after the partition at least the work is being delegated. > However, I do not think kryo can serialise avro objects without me writing > some serialization methods :( as the job produces no output now. > > How do you advise I proceed. Should I continue using avro/parquet or switch > to something else ? And do I need to set up HDFS on my laptop ? > > Kind Regards > > Hassan > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1135.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
