How to limit the number of concurrent tasks per node?

2015-01-06 Thread Pengcheng YIN
Hi Pro, One map() operation in my Spark APP takes an RDD[A] as input and map each element in RDD[A] using a custom mapping function func(x:A):B to another object of type B. I received lots of OutOfMemory error, and after some debugging I find this is because func() requires significant amount

does calling cache()/persist() on a RDD trigger its immediate evaluation?

2015-01-03 Thread Pengcheng YIN
Hi Pro, I have a question regarding calling cache()/persist() on an RDD. All RDDs in Spark are lazily evaluated, but does calling cache()/persist() on a RDD trigger its immediate evaluation? My spark app is something like this: val rdd = sc.textFile().map() rdd.persist() while(true){ val c

merge elements in a Spark RDD under custom condition

2014-12-01 Thread Pengcheng YIN
Hi Pro, I want to merge elements in a Spark RDD when the two elements satisfy certain condition Suppose there is a RDD[Seq[Int]], where some Seq[Int] in this RDD contain overlapping elements. The task is to merge all overlapping Seq[Int] in this RDD, and store the result into a new RDD. For ex