Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions.
Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > RDD#union is not the same thing as SparkContext#union > > On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen <y...@yang-cs.com> wrote: > >> Hi Noorul, >> >> Thank you for your suggestion. I tried that, but ran out of memory. I did >> some search and found some suggestions >> that we should try to avoid rdd.union( >> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark >> ). >> I will try to come up with some other ways. >> >> Thank you, >> Yang >> >> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noo...@noorul.com> >> wrote: >> >>> sparkx <y...@yang-cs.com> writes: >>> >>> > Hi, >>> > >>> > I have a Spark job and a dataset of 0.5 Million items. Each item >>> performs >>> > some sort of computation (joining a shared external dataset, if that >>> does >>> > matter) and produces an RDD containing 20-500 result items. Now I >>> would like >>> > to combine all these RDDs and perform a next job. What I have found >>> out is >>> > that the computation itself is quite fast, but combining these RDDs >>> takes >>> > much longer time. >>> > >>> > val result = data // 0.5M data items >>> > .map(compute(_)) // Produces an RDD - fast >>> > .reduce(_ ++ _) // Combining RDDs - slow >>> > >>> > I have also tried to collect results from compute(_) and use a >>> flatMap, but >>> > that is also slow. >>> > >>> > Is there a way to efficiently do this? I'm thinking about writing this >>> > result to HDFS and reading from disk for the next job, but am not sure >>> if >>> > that's a preferred way in Spark. >>> > >>> >>> Are you looking for SparkContext.union() [1] ? >>> >>> This is not performing well with spark cassandra connector. I am not >>> sure whether this will help you. >>> >>> Thanks and Regards >>> Noorul >>> >>> [1] >>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext >>> >> >> >> >> -- >> Yang Chen >> Dept. of CISE, University of Florida >> Mail: y...@yang-cs.com >> Web: www.cise.ufl.edu/~yang >> > > -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang