Re: Combining Many RDDs

2015-03-27 Thread Yang Chen
Hi Kelvin,

Thank you. That works for me. I wrote my own joins that produced Scala
collections, instead of using rdd.join.

Regards,
Yang

On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu 2dot7kel...@gmail.com wrote:

 Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
 variable 'data' is a Scala collection and compute() returns an RDD. Right?
 If yes, I tried the approach below to operate on one RDD only during the
 whole computation (Yes, I also saw that too many RDD hurt performance).

 Change compute() to return Scala collection instead of RDD.

 val result = sc.parallelize(data)// Create and partition the
 0.5M items in a single RDD.
   .flatMap(compute(_))   // You still have only one RDD with each item
 joined with external data already

 Hope this help.

 Kelvin

 On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote:

 Hi Mark,

 That's true, but in neither way can I combine the RDDs, so I have to
 avoid unions.

 Thanks,
 Yang

 On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 RDD#union is not the same thing as SparkContext#union

 On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:

 Hi Noorul,

 Thank you for your suggestion. I tried that, but ran out of memory. I
 did some search and found some suggestions
 that we should try to avoid rdd.union(
 http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
 ).
 I will try to come up with some other ways.

 Thank you,
 Yang

 On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
 wrote:

 sparkx y...@yang-cs.com writes:

  Hi,
 
  I have a Spark job and a dataset of 0.5 Million items. Each item
 performs
  some sort of computation (joining a shared external dataset, if that
 does
  matter) and produces an RDD containing 20-500 result items. Now I
 would like
  to combine all these RDDs and perform a next job. What I have found
 out is
  that the computation itself is quite fast, but combining these RDDs
 takes
  much longer time.
 
  val result = data// 0.5M data items
.map(compute(_))   // Produces an RDD - fast
.reduce(_ ++ _)  // Combining RDDs - slow
 
  I have also tried to collect results from compute(_) and use a
 flatMap, but
  that is also slow.
 
  Is there a way to efficiently do this? I'm thinking about writing
 this
  result to HDFS and reading from disk for the next job, but am not
 sure if
  that's a preferred way in Spark.
 

 Are you looking for SparkContext.union() [1] ?

 This is not performing well with spark cassandra connector. I am not
 sure whether this will help you.

 Thanks and Regards
 Noorul

 [1]
 http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext




 --
 Yang Chen
 Dept. of CISE, University of Florida
 Mail: y...@yang-cs.com
 Web: www.cise.ufl.edu/~yang





 --
 Yang Chen
 Dept. of CISE, University of Florida
 Mail: y...@yang-cs.com
 Web: www.cise.ufl.edu/~yang





-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang


Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
Yang Chen y...@yang-cs.com writes:

 Hi Noorul,

 Thank you for your suggestion. I tried that, but ran out of memory. I did
 some search and found some suggestions
 that we should try to avoid rdd.union(
 http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
 ).
 I will try to come up with some other ways.


I think you are using rdd.union(), but I was referring to
SparkContext.union(). I am not sure about the number of RDDs that you
have but I had no issues with memory when I used it to combine 2000
RDDs. Having said that I had other performance issues with spark
cassandra connector.

Thanks and Regards
Noorul


 On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com wrote:

 sparkx y...@yang-cs.com writes:

  Hi,
 
  I have a Spark job and a dataset of 0.5 Million items. Each item performs
  some sort of computation (joining a shared external dataset, if that does
  matter) and produces an RDD containing 20-500 result items. Now I would
 like
  to combine all these RDDs and perform a next job. What I have found out
 is
  that the computation itself is quite fast, but combining these RDDs takes
  much longer time.
 
  val result = data// 0.5M data items
.map(compute(_))   // Produces an RDD - fast
.reduce(_ ++ _)  // Combining RDDs - slow
 
  I have also tried to collect results from compute(_) and use a flatMap,
 but
  that is also slow.
 
  Is there a way to efficiently do this? I'm thinking about writing this
  result to HDFS and reading from disk for the next job, but am not sure if
  that's a preferred way in Spark.
 

 Are you looking for SparkContext.union() [1] ?

 This is not performing well with spark cassandra connector. I am not
 sure whether this will help you.

 Thanks and Regards
 Noorul

 [1]
 http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M
sparkx y...@yang-cs.com writes:

 Hi,

 I have a Spark job and a dataset of 0.5 Million items. Each item performs
 some sort of computation (joining a shared external dataset, if that does
 matter) and produces an RDD containing 20-500 result items. Now I would like
 to combine all these RDDs and perform a next job. What I have found out is
 that the computation itself is quite fast, but combining these RDDs takes
 much longer time.

 val result = data// 0.5M data items
   .map(compute(_))   // Produces an RDD - fast
   .reduce(_ ++ _)  // Combining RDDs - slow

 I have also tried to collect results from compute(_) and use a flatMap, but
 that is also slow.

 Is there a way to efficiently do this? I'm thinking about writing this
 result to HDFS and reading from disk for the next job, but am not sure if
 that's a preferred way in Spark.


Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1] 
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Combining Many RDDs

2015-03-26 Thread sparkx
Hi,

I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would like
to combine all these RDDs and perform a next job. What I have found out is
that the computation itself is quite fast, but combining these RDDs takes
much longer time.

val result = data// 0.5M data items
  .map(compute(_))   // Produces an RDD - fast
  .reduce(_ ++ _)  // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a flatMap, but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not sure if
that's a preferred way in Spark.

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Combining-Many-RDDs-tp22243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Combining Many RDDs

2015-03-26 Thread Kelvin Chu
Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
variable 'data' is a Scala collection and compute() returns an RDD. Right?
If yes, I tried the approach below to operate on one RDD only during the
whole computation (Yes, I also saw that too many RDD hurt performance).

Change compute() to return Scala collection instead of RDD.

val result = sc.parallelize(data)// Create and partition the
0.5M items in a single RDD.
  .flatMap(compute(_))   // You still have only one RDD with each item
joined with external data already

Hope this help.

Kelvin

On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote:

 Hi Mark,

 That's true, but in neither way can I combine the RDDs, so I have to avoid
 unions.

 Thanks,
 Yang

 On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
 wrote:

 RDD#union is not the same thing as SparkContext#union

 On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:

 Hi Noorul,

 Thank you for your suggestion. I tried that, but ran out of memory. I
 did some search and found some suggestions
 that we should try to avoid rdd.union(
 http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
 ).
 I will try to come up with some other ways.

 Thank you,
 Yang

 On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
 wrote:

 sparkx y...@yang-cs.com writes:

  Hi,
 
  I have a Spark job and a dataset of 0.5 Million items. Each item
 performs
  some sort of computation (joining a shared external dataset, if that
 does
  matter) and produces an RDD containing 20-500 result items. Now I
 would like
  to combine all these RDDs and perform a next job. What I have found
 out is
  that the computation itself is quite fast, but combining these RDDs
 takes
  much longer time.
 
  val result = data// 0.5M data items
.map(compute(_))   // Produces an RDD - fast
.reduce(_ ++ _)  // Combining RDDs - slow
 
  I have also tried to collect results from compute(_) and use a
 flatMap, but
  that is also slow.
 
  Is there a way to efficiently do this? I'm thinking about writing this
  result to HDFS and reading from disk for the next job, but am not
 sure if
  that's a preferred way in Spark.
 

 Are you looking for SparkContext.union() [1] ?

 This is not performing well with spark cassandra connector. I am not
 sure whether this will help you.

 Thanks and Regards
 Noorul

 [1]
 http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext




 --
 Yang Chen
 Dept. of CISE, University of Florida
 Mail: y...@yang-cs.com
 Web: www.cise.ufl.edu/~yang





 --
 Yang Chen
 Dept. of CISE, University of Florida
 Mail: y...@yang-cs.com
 Web: www.cise.ufl.edu/~yang



Re: Combining Many RDDs

2015-03-26 Thread Mark Hamstra
RDD#union is not the same thing as SparkContext#union

On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:

 Hi Noorul,

 Thank you for your suggestion. I tried that, but ran out of memory. I did
 some search and found some suggestions
 that we should try to avoid rdd.union(
 http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
 ).
 I will try to come up with some other ways.

 Thank you,
 Yang

 On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
 wrote:

 sparkx y...@yang-cs.com writes:

  Hi,
 
  I have a Spark job and a dataset of 0.5 Million items. Each item
 performs
  some sort of computation (joining a shared external dataset, if that
 does
  matter) and produces an RDD containing 20-500 result items. Now I would
 like
  to combine all these RDDs and perform a next job. What I have found out
 is
  that the computation itself is quite fast, but combining these RDDs
 takes
  much longer time.
 
  val result = data// 0.5M data items
.map(compute(_))   // Produces an RDD - fast
.reduce(_ ++ _)  // Combining RDDs - slow
 
  I have also tried to collect results from compute(_) and use a flatMap,
 but
  that is also slow.
 
  Is there a way to efficiently do this? I'm thinking about writing this
  result to HDFS and reading from disk for the next job, but am not sure
 if
  that's a preferred way in Spark.
 

 Are you looking for SparkContext.union() [1] ?

 This is not performing well with spark cassandra connector. I am not
 sure whether this will help you.

 Thanks and Regards
 Noorul

 [1]
 http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext




 --
 Yang Chen
 Dept. of CISE, University of Florida
 Mail: y...@yang-cs.com
 Web: www.cise.ufl.edu/~yang



Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Mark,

That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.

Thanks,
Yang

On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:

 RDD#union is not the same thing as SparkContext#union

 On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:

 Hi Noorul,

 Thank you for your suggestion. I tried that, but ran out of memory. I did
 some search and found some suggestions
 that we should try to avoid rdd.union(
 http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
 ).
 I will try to come up with some other ways.

 Thank you,
 Yang

 On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
 wrote:

 sparkx y...@yang-cs.com writes:

  Hi,
 
  I have a Spark job and a dataset of 0.5 Million items. Each item
 performs
  some sort of computation (joining a shared external dataset, if that
 does
  matter) and produces an RDD containing 20-500 result items. Now I
 would like
  to combine all these RDDs and perform a next job. What I have found
 out is
  that the computation itself is quite fast, but combining these RDDs
 takes
  much longer time.
 
  val result = data// 0.5M data items
.map(compute(_))   // Produces an RDD - fast
.reduce(_ ++ _)  // Combining RDDs - slow
 
  I have also tried to collect results from compute(_) and use a
 flatMap, but
  that is also slow.
 
  Is there a way to efficiently do this? I'm thinking about writing this
  result to HDFS and reading from disk for the next job, but am not sure
 if
  that's a preferred way in Spark.
 

 Are you looking for SparkContext.union() [1] ?

 This is not performing well with spark cassandra connector. I am not
 sure whether this will help you.

 Thanks and Regards
 Noorul

 [1]
 http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext




 --
 Yang Chen
 Dept. of CISE, University of Florida
 Mail: y...@yang-cs.com
 Web: www.cise.ufl.edu/~yang





-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang