Fwd: the life cycle shuffle Dependency
hi, I'm learning spark, and wonder when to delete shuffle data, I find the ContextCleaner class which clean the shuffle data when shuffle dependency is GC-ed. Based on source code, the shuffle dependency is gc-ed only when active job finish, but i'm not sure, Could you explain the life cycle of a Shuffle-Dependency?
Re: Combining Many RDDs
Hi Kelvin, Thank you. That works for me. I wrote my own joins that produced Scala collections, instead of using rdd.join. Regards, Yang On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu <2dot7kel...@gmail.com> wrote: > Hi, I used union() before and yes it may be slow sometimes. I _guess_ your > variable 'data' is a Scala collection and compute() returns an RDD. Right? > If yes, I tried the approach below to operate on one RDD only during the > whole computation (Yes, I also saw that too many RDD hurt performance). > > Change compute() to return Scala collection instead of RDD. > > val result = sc.parallelize(data)// Create and partition the > 0.5M items in a single RDD. > .flatMap(compute(_)) // You still have only one RDD with each item > joined with external data already > > Hope this help. > > Kelvin > > On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen wrote: > >> Hi Mark, >> >> That's true, but in neither way can I combine the RDDs, so I have to >> avoid unions. >> >> Thanks, >> Yang >> >> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra >> wrote: >> >>> RDD#union is not the same thing as SparkContext#union >>> >>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen wrote: >>> >>>> Hi Noorul, >>>> >>>> Thank you for your suggestion. I tried that, but ran out of memory. I >>>> did some search and found some suggestions >>>> that we should try to avoid rdd.union( >>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark >>>> ). >>>> I will try to come up with some other ways. >>>> >>>> Thank you, >>>> Yang >>>> >>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M >>>> wrote: >>>> >>>>> sparkx writes: >>>>> >>>>> > Hi, >>>>> > >>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item >>>>> performs >>>>> > some sort of computation (joining a shared external dataset, if that >>>>> does >>>>> > matter) and produces an RDD containing 20-500 result items. Now I >>>>> would like >>>>> > to combine all these RDDs and perform a next job. What I have found >>>>> out is >>>>> > that the computation itself is quite fast, but combining these RDDs >>>>> takes >>>>> > much longer time. >>>>> > >>>>> > val result = data// 0.5M data items >>>>> > .map(compute(_)) // Produces an RDD - fast >>>>> > .reduce(_ ++ _) // Combining RDDs - slow >>>>> > >>>>> > I have also tried to collect results from compute(_) and use a >>>>> flatMap, but >>>>> > that is also slow. >>>>> > >>>>> > Is there a way to efficiently do this? I'm thinking about writing >>>>> this >>>>> > result to HDFS and reading from disk for the next job, but am not >>>>> sure if >>>>> > that's a preferred way in Spark. >>>>> > >>>>> >>>>> Are you looking for SparkContext.union() [1] ? >>>>> >>>>> This is not performing well with spark cassandra connector. I am not >>>>> sure whether this will help you. >>>>> >>>>> Thanks and Regards >>>>> Noorul >>>>> >>>>> [1] >>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext >>>>> >>>> >>>> >>>> >>>> -- >>>> Yang Chen >>>> Dept. of CISE, University of Florida >>>> Mail: y...@yang-cs.com >>>> Web: www.cise.ufl.edu/~yang >>>> >>> >>> >> >> >> -- >> Yang Chen >> Dept. of CISE, University of Florida >> Mail: y...@yang-cs.com >> Web: www.cise.ufl.edu/~yang >> > > -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang
Re: Combining Many RDDs
Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra wrote: > RDD#union is not the same thing as SparkContext#union > > On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen wrote: > >> Hi Noorul, >> >> Thank you for your suggestion. I tried that, but ran out of memory. I did >> some search and found some suggestions >> that we should try to avoid rdd.union( >> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark >> ). >> I will try to come up with some other ways. >> >> Thank you, >> Yang >> >> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M >> wrote: >> >>> sparkx writes: >>> >>> > Hi, >>> > >>> > I have a Spark job and a dataset of 0.5 Million items. Each item >>> performs >>> > some sort of computation (joining a shared external dataset, if that >>> does >>> > matter) and produces an RDD containing 20-500 result items. Now I >>> would like >>> > to combine all these RDDs and perform a next job. What I have found >>> out is >>> > that the computation itself is quite fast, but combining these RDDs >>> takes >>> > much longer time. >>> > >>> > val result = data// 0.5M data items >>> > .map(compute(_)) // Produces an RDD - fast >>> > .reduce(_ ++ _) // Combining RDDs - slow >>> > >>> > I have also tried to collect results from compute(_) and use a >>> flatMap, but >>> > that is also slow. >>> > >>> > Is there a way to efficiently do this? I'm thinking about writing this >>> > result to HDFS and reading from disk for the next job, but am not sure >>> if >>> > that's a preferred way in Spark. >>> > >>> >>> Are you looking for SparkContext.union() [1] ? >>> >>> This is not performing well with spark cassandra connector. I am not >>> sure whether this will help you. >>> >>> Thanks and Regards >>> Noorul >>> >>> [1] >>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext >>> >> >> >> >> -- >> Yang Chen >> Dept. of CISE, University of Florida >> Mail: y...@yang-cs.com >> Web: www.cise.ufl.edu/~yang >> > > -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang
Re: Combining Many RDDs
Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try to come up with some other ways. Thank you, Yang On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M wrote: > sparkx writes: > > > Hi, > > > > I have a Spark job and a dataset of 0.5 Million items. Each item performs > > some sort of computation (joining a shared external dataset, if that does > > matter) and produces an RDD containing 20-500 result items. Now I would > like > > to combine all these RDDs and perform a next job. What I have found out > is > > that the computation itself is quite fast, but combining these RDDs takes > > much longer time. > > > > val result = data// 0.5M data items > > .map(compute(_)) // Produces an RDD - fast > > .reduce(_ ++ _) // Combining RDDs - slow > > > > I have also tried to collect results from compute(_) and use a flatMap, > but > > that is also slow. > > > > Is there a way to efficiently do this? I'm thinking about writing this > > result to HDFS and reading from disk for the next job, but am not sure if > > that's a preferred way in Spark. > > > > Are you looking for SparkContext.union() [1] ? > > This is not performing well with spark cassandra connector. I am not > sure whether this will help you. > > Thanks and Regards > Noorul > > [1] > http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext > -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang