Re: Combining Many RDDs
Hi Kelvin, Thank you. That works for me. I wrote my own joins that produced Scala collections, instead of using rdd.join. Regards, Yang On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu 2dot7kel...@gmail.com wrote: Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable 'data' is a Scala collection and compute() returns an RDD. Right? If yes, I tried the approach below to operate on one RDD only during the whole computation (Yes, I also saw that too many RDD hurt performance). Change compute() to return Scala collection instead of RDD. val result = sc.parallelize(data)// Create and partition the 0.5M items in a single RDD. .flatMap(compute(_)) // You still have only one RDD with each item joined with external data already Hope this help. Kelvin On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote: Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com wrote: RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote: Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try to come up with some other ways. Thank you, Yang On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com wrote: sparkx y...@yang-cs.com writes: Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found out is that the computation itself is quite fast, but combining these RDDs takes much longer time. val result = data// 0.5M data items .map(compute(_)) // Produces an RDD - fast .reduce(_ ++ _) // Combining RDDs - slow I have also tried to collect results from compute(_) and use a flatMap, but that is also slow. Is there a way to efficiently do this? I'm thinking about writing this result to HDFS and reading from disk for the next job, but am not sure if that's a preferred way in Spark. Are you looking for SparkContext.union() [1] ? This is not performing well with spark cassandra connector. I am not sure whether this will help you. Thanks and Regards Noorul [1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang
Re: Combining Many RDDs
Yang Chen y...@yang-cs.com writes: Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try to come up with some other ways. I think you are using rdd.union(), but I was referring to SparkContext.union(). I am not sure about the number of RDDs that you have but I had no issues with memory when I used it to combine 2000 RDDs. Having said that I had other performance issues with spark cassandra connector. Thanks and Regards Noorul On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com wrote: sparkx y...@yang-cs.com writes: Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found out is that the computation itself is quite fast, but combining these RDDs takes much longer time. val result = data// 0.5M data items .map(compute(_)) // Produces an RDD - fast .reduce(_ ++ _) // Combining RDDs - slow I have also tried to collect results from compute(_) and use a flatMap, but that is also slow. Is there a way to efficiently do this? I'm thinking about writing this result to HDFS and reading from disk for the next job, but am not sure if that's a preferred way in Spark. Are you looking for SparkContext.union() [1] ? This is not performing well with spark cassandra connector. I am not sure whether this will help you. Thanks and Regards Noorul [1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Combining Many RDDs
sparkx y...@yang-cs.com writes: Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found out is that the computation itself is quite fast, but combining these RDDs takes much longer time. val result = data// 0.5M data items .map(compute(_)) // Produces an RDD - fast .reduce(_ ++ _) // Combining RDDs - slow I have also tried to collect results from compute(_) and use a flatMap, but that is also slow. Is there a way to efficiently do this? I'm thinking about writing this result to HDFS and reading from disk for the next job, but am not sure if that's a preferred way in Spark. Are you looking for SparkContext.union() [1] ? This is not performing well with spark cassandra connector. I am not sure whether this will help you. Thanks and Regards Noorul [1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Combining Many RDDs
Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found out is that the computation itself is quite fast, but combining these RDDs takes much longer time. val result = data// 0.5M data items .map(compute(_)) // Produces an RDD - fast .reduce(_ ++ _) // Combining RDDs - slow I have also tried to collect results from compute(_) and use a flatMap, but that is also slow. Is there a way to efficiently do this? I'm thinking about writing this result to HDFS and reading from disk for the next job, but am not sure if that's a preferred way in Spark. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Combining-Many-RDDs-tp22243.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Combining Many RDDs
Hi, I used union() before and yes it may be slow sometimes. I _guess_ your variable 'data' is a Scala collection and compute() returns an RDD. Right? If yes, I tried the approach below to operate on one RDD only during the whole computation (Yes, I also saw that too many RDD hurt performance). Change compute() to return Scala collection instead of RDD. val result = sc.parallelize(data)// Create and partition the 0.5M items in a single RDD. .flatMap(compute(_)) // You still have only one RDD with each item joined with external data already Hope this help. Kelvin On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote: Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com wrote: RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote: Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try to come up with some other ways. Thank you, Yang On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com wrote: sparkx y...@yang-cs.com writes: Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found out is that the computation itself is quite fast, but combining these RDDs takes much longer time. val result = data// 0.5M data items .map(compute(_)) // Produces an RDD - fast .reduce(_ ++ _) // Combining RDDs - slow I have also tried to collect results from compute(_) and use a flatMap, but that is also slow. Is there a way to efficiently do this? I'm thinking about writing this result to HDFS and reading from disk for the next job, but am not sure if that's a preferred way in Spark. Are you looking for SparkContext.union() [1] ? This is not performing well with spark cassandra connector. I am not sure whether this will help you. Thanks and Regards Noorul [1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang
Re: Combining Many RDDs
RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote: Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try to come up with some other ways. Thank you, Yang On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com wrote: sparkx y...@yang-cs.com writes: Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found out is that the computation itself is quite fast, but combining these RDDs takes much longer time. val result = data// 0.5M data items .map(compute(_)) // Produces an RDD - fast .reduce(_ ++ _) // Combining RDDs - slow I have also tried to collect results from compute(_) and use a flatMap, but that is also slow. Is there a way to efficiently do this? I'm thinking about writing this result to HDFS and reading from disk for the next job, but am not sure if that's a preferred way in Spark. Are you looking for SparkContext.union() [1] ? This is not performing well with spark cassandra connector. I am not sure whether this will help you. Thanks and Regards Noorul [1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang
Re: Combining Many RDDs
Hi Mark, That's true, but in neither way can I combine the RDDs, so I have to avoid unions. Thanks, Yang On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com wrote: RDD#union is not the same thing as SparkContext#union On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote: Hi Noorul, Thank you for your suggestion. I tried that, but ran out of memory. I did some search and found some suggestions that we should try to avoid rdd.union( http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark ). I will try to come up with some other ways. Thank you, Yang On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com wrote: sparkx y...@yang-cs.com writes: Hi, I have a Spark job and a dataset of 0.5 Million items. Each item performs some sort of computation (joining a shared external dataset, if that does matter) and produces an RDD containing 20-500 result items. Now I would like to combine all these RDDs and perform a next job. What I have found out is that the computation itself is quite fast, but combining these RDDs takes much longer time. val result = data// 0.5M data items .map(compute(_)) // Produces an RDD - fast .reduce(_ ++ _) // Combining RDDs - slow I have also tried to collect results from compute(_) and use a flatMap, but that is also slow. Is there a way to efficiently do this? I'm thinking about writing this result to HDFS and reading from disk for the next job, but am not sure if that's a preferred way in Spark. Are you looking for SparkContext.union() [1] ? This is not performing well with spark cassandra connector. I am not sure whether this will help you. Thanks and Regards Noorul [1] http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang -- Yang Chen Dept. of CISE, University of Florida Mail: y...@yang-cs.com Web: www.cise.ufl.edu/~yang