Hi Noorul,

Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.

Thank you,
Yang

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M <noo...@noorul.com> wrote:

> sparkx <y...@yang-cs.com> writes:
>
> > Hi,
> >
> > I have a Spark job and a dataset of 0.5 Million items. Each item performs
> > some sort of computation (joining a shared external dataset, if that does
> > matter) and produces an RDD containing 20-500 result items. Now I would
> like
> > to combine all these RDDs and perform a next job. What I have found out
> is
> > that the computation itself is quite fast, but combining these RDDs takes
> > much longer time.
> >
> >     val result = data        // 0.5M data items
> >       .map(compute(_))   // Produces an RDD - fast
> >       .reduce(_ ++ _)      // Combining RDDs - slow
> >
> > I have also tried to collect results from compute(_) and use a flatMap,
> but
> > that is also slow.
> >
> > Is there a way to efficiently do this? I'm thinking about writing this
> > result to HDFS and reading from disk for the next job, but am not sure if
> > that's a preferred way in Spark.
> >
>
> Are you looking for SparkContext.union() [1] ?
>
> This is not performing well with spark cassandra connector. I am not
> sure whether this will help you.
>
> Thanks and Regards
> Noorul
>
> [1]
> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>



-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang

Reply via email to