Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try to come up with some other ways.
Thank you, Yang On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noo...@noorul.com> wrote: > sparkx <y...@yang-cs.com> writes: > > > Hi, > > > > I have a Spark job and a dataset of 0.5 Million items. Each item performs > > some sort of computation (joining a shared external dataset, if that does > > matter) and produces an RDD containing 20-500 result items. Now I would > like > > to combine all these RDDs and perform a next job. What I have found out > is > > that the computation itself is quite fast, but combining these RDDs takes > > much longer time. > > > > val result = data // 0.5M data items > > .map(compute(_)) // Produces an RDD - fast > > .reduce(_ ++ _) // Combining RDDs - slow > > > > I have also tried to collect results from compute(_) and use a flatMap, > but > > that is also slow. > > > > Is there a way to efficiently do this? I'm thinking about writing this > > result to HDFS and reading from disk for the next job, but am not sure if > > that's a preferred way in Spark. > > > > Are you looking for SparkContext.union() [1] ? > > This is not performing well with spark cassandra connector. I am not > sure whether this will help you. > > Thanks and Regards > Noorul > > [1] > http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext > -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang