from:"Yuri Makhno"

Re: Re: rdd.cache() not working ?

2015-04-01 Thread Yuri Makhno

cache() method returns new RDD so you have to use something like this: val person = sc.textFile("hdfs://namenode_host:8020/user/person.txt").map(_.split(",")).map(p => Person(p(0).trim.toInt, p(1))) val cached = person.cache cached.count when you rerun count on cached you will see that ca

spark.sql.shuffle.partitions parameter

2015-01-03 Thread Yuri Makhno

Hello everyone, I'm using SparkSQL and would like to understand how can I determine right value for "spark.sql.shuffle.partitions" parameter? For example if I'm joining two RDDs where first has 10 partitions and second - 60, how big this parameter should be? Thank you, Yuri