subject:"Combining Many RDDs"

Re: Combining Many RDDs

2015-03-27 Thread Yang Chen

Hi Kelvin,

Thank you. That works for me. I wrote my own joins that produced Scala
collections, instead of using rdd.join.

Regards,
Yang

On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu 2dot7kel...@gmail.com wrote:

Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
variable 'data' is a Scala collection and compute() returns an RDD. Right?
If yes, I tried the approach below to operate on one RDD only during the
whole computation (Yes, I also saw that too many RDD hurt performance).

Change compute() to return Scala collection instead of RDD.

val result = sc.parallelize(data)// Create and partition the
0.5M items in a single RDD.
.flatMap(compute(_)) // You still have only one RDD with each item
joined with external data already

Hope this help.

Kelvin

On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote:

Hi Mark,

That's true, but in neither way can I combine the RDDs, so I have to
avoid unions.

Thanks,
Yang

On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:

RDD#union is not the same thing as SparkContext#union

On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:

Hi Noorul,

Thank you for your suggestion. I tried that, but ran out of memory. I
did some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.

Thank you,
Yang

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
wrote:

sparkx y...@yang-cs.com writes:

Hi,

I have a Spark job and a dataset of 0.5 Million items. Each item
performs
some sort of computation (joining a shared external dataset, if that
does
matter) and produces an RDD containing 20-500 result items. Now I
would like
to combine all these RDDs and perform a next job. What I have found
out is
that the computation itself is quite fast, but combining these RDDs
takes
much longer time.

val result = data// 0.5M data items
.map(compute(_)) // Produces an RDD - fast
.reduce(_ ++ _) // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a
flatMap, but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing
this
result to HDFS and reading from disk for the next job, but am not
sure if
that's a preferred way in Spark.

Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1]
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext

--
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M

Yang Chen y...@yang-cs.com writes:

Hi Noorul,

Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.

I think you are using rdd.union(), but I was referring to
SparkContext.union(). I am not sure about the number of RDDs that you
have but I had no issues with memory when I used it to combine 2000
RDDs. Having said that I had other performance issues with spark
cassandra connector.

Thanks and Regards
Noorul

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com wrote:

sparkx y...@yang-cs.com writes:

Hi,

I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would
like
to combine all these RDDs and perform a next job. What I have found out
is
that the computation itself is quite fast, but combining these RDDs takes
much longer time.

val result = data// 0.5M data items
.map(compute(_)) // Produces an RDD - fast
.reduce(_ ++ _) // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a flatMap,
but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not sure if
that's a preferred way in Spark.

Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1]
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Combining Many RDDs

2015-03-26 Thread Noorul Islam K M

sparkx y...@yang-cs.com writes:

 Hi,

 I have a Spark job and a dataset of 0.5 Million items. Each item performs
 some sort of computation (joining a shared external dataset, if that does
 matter) and produces an RDD containing 20-500 result items. Now I would like
 to combine all these RDDs and perform a next job. What I have found out is
 that the computation itself is quite fast, but combining these RDDs takes
 much longer time.

 val result = data// 0.5M data items
   .map(compute(_))   // Produces an RDD - fast
   .reduce(_ ++ _)  // Combining RDDs - slow

 I have also tried to collect results from compute(_) and use a flatMap, but
 that is also slow.

 Is there a way to efficiently do this? I'm thinking about writing this
 result to HDFS and reading from disk for the next job, but am not sure if
 that's a preferred way in Spark.


Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1] 
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Combining Many RDDs

2015-03-26 Thread sparkx

Hi,

I have a Spark job and a dataset of 0.5 Million items. Each item performs
some sort of computation (joining a shared external dataset, if that does
matter) and produces an RDD containing 20-500 result items. Now I would like
to combine all these RDDs and perform a next job. What I have found out is
that the computation itself is quite fast, but combining these RDDs takes
much longer time.

val result = data// 0.5M data items
  .map(compute(_))   // Produces an RDD - fast
  .reduce(_ ++ _)  // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a flatMap, but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not sure if
that's a preferred way in Spark.

Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Combining-Many-RDDs-tp22243.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Combining Many RDDs

2015-03-26 Thread Kelvin Chu

Change compute() to return Scala collection instead of RDD.

val result = sc.parallelize(data)// Create and partition the
0.5M items in a single RDD.
.flatMap(compute(_)) // You still have only one RDD with each item
joined with external data already

Hope this help.

Kelvin

On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote:

Hi Mark,

That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.

Thanks,
Yang

On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:

RDD#union is not the same thing as SparkContext#union

On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:

Hi Noorul,

Thank you,
Yang

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
wrote:

sparkx y...@yang-cs.com writes:

Hi,

val result = data// 0.5M data items
.map(compute(_)) // Produces an RDD - fast
.reduce(_ ++ _) // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a
flatMap, but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not
sure if
that's a preferred way in Spark.

Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1]
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext

--
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang

Re: Combining Many RDDs

2015-03-26 Thread Mark Hamstra

RDD#union is not the same thing as SparkContext#union

On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:

Hi Noorul,

Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.

Thank you,
Yang

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
wrote:

sparkx y...@yang-cs.com writes:

Hi,

I have a Spark job and a dataset of 0.5 Million items. Each item
performs
some sort of computation (joining a shared external dataset, if that
does
matter) and produces an RDD containing 20-500 result items. Now I would
like
to combine all these RDDs and perform a next job. What I have found out
is
that the computation itself is quite fast, but combining these RDDs
takes
much longer time.

val result = data// 0.5M data items
.map(compute(_)) // Produces an RDD - fast
.reduce(_ ++ _) // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a flatMap,
but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not sure
if
that's a preferred way in Spark.

Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1]
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext

--
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang

Re: Combining Many RDDs

2015-03-26 Thread Yang Chen

Hi Mark,

That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.

Thanks,
Yang

On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:

RDD#union is not the same thing as SparkContext#union

On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:

Hi Noorul,

Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.

Thank you,
Yang

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M noo...@noorul.com
wrote:

sparkx y...@yang-cs.com writes:

Hi,

val result = data// 0.5M data items
.map(compute(_)) // Produces an RDD - fast
.reduce(_ ++ _) // Combining RDDs - slow

I have also tried to collect results from compute(_) and use a
flatMap, but
that is also slow.

Is there a way to efficiently do this? I'm thinking about writing this
result to HDFS and reading from disk for the next job, but am not sure
if
that's a preferred way in Spark.

Are you looking for SparkContext.union() [1] ?

This is not performing well with spark cassandra connector. I am not
sure whether this will help you.

Thanks and Regards
Noorul

[1]
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext

--
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang

Re: Combining Many RDDs

Re: Combining Many RDDs

Re: Combining Many RDDs

Combining Many RDDs

Re: Combining Many RDDs

Re: Combining Many RDDs

Re: Combining Many RDDs

7 matches

Site Navigation

Mail list logo

Footer information