Hassan,

I’m not sure why only a single core is being used to process 4 partitions. It 
shouldn’t have anything to do with not using HDFS, but that’s pure conjecture 
on my part.

RE: Kryo serialization, Matt Massie has a good blog post on using Parquet and 
Avro with Spark. Additionally, here is a link towards source that is used in a 
project to register Avro with Kryo.

Regards,
 
Frank Austin Nothaft
[email protected]
[email protected]
202-340-0466

On Feb 2, 2014, at 10:01 AM, Hassan Syed <[email protected]> wrote:

> Many thanks for replying.
> 
> Note I am not running HDFS on my laptop, but I am using the local
> filesystem.
> 
> I am seeing this from the console output : 
> 
> 14/02/02 17:47:41 INFO rdd.NewHadoopRDD: Input split:
> ParquetInputSplit{part:
> file:///Users/hassan/code/scala/avro/forum_dataset.parq start: 0 length:
> 1023817737 hosts: [localhost] blocks: 4 requestedSchema: same as file
> fileSchema: message forumavroschema.Topic 
> 
> So I guess there are indeed 4 partitions. I am seeing only a single core
> being used, and only the driver shows up in the web console as an executor.
> Is this because of me not using HDFS ? 
> 
> I had a hunch that the block size was not being picked up for some reason so
> I tried repartition(16) on the input RDD, And from the spew on the console
> it seems now that after the partition at least the work is being delegated.
> However, I do not think kryo can serialise avro objects without me writing
> some serialization methods :( as the job produces no output now. 
> 
> How do you advise I proceed. Should I continue using avro/parquet or switch
> to something else ? And do I need to set up HDFS on my laptop ?
> 
> Kind Regards
> 
> Hassan
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-t-get-a-local-job-to-parallelise-using-0-9-0-from-git-with-parquet-and-avro-tp1130p1135.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to