Re: How to run two operations on the same RDD simultaneously

2015-11-25 Thread Jay Luan
Ah, thank you so much, this is perfect

On Fri, Nov 20, 2015 at 3:48 PM, Ali Tajeldin EDU 
wrote:

> You can try to use an Accumulator (
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator)
> to keep count in map1.  Note that the final count may be higher than the
> number of records if there were some retries along the way.
> --
> Ali
>
> On Nov 20, 2015, at 3:38 PM, jluan  wrote:
>
> As far as I understand, operations on rdd's usually come in the form
>
> rdd => map1 => map2 => map2 => (maybe collect)
>
> If I would like to also count my RDD, is there any way I could include this
> at map1? So that as spark runs through map1, it also does a count? Or would
> count need to be a separate operation such that I would have to run through
> my dataset again. My dataset is really memory intensive so I'd rather not
> cache() it if possible.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-two-operations-on-the-same-RDD-simultaneously-tp25441.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>


Re: How to run two operations on the same RDD simultaneously

2015-11-20 Thread Ali Tajeldin EDU
You can try to use an Accumulator 
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator)
 to keep count in map1.  Note that the final count may be higher than the 
number of records if there were some retries along the way.
--
Ali

On Nov 20, 2015, at 3:38 PM, jluan  wrote:

> As far as I understand, operations on rdd's usually come in the form
> 
> rdd => map1 => map2 => map2 => (maybe collect)
> 
> If I would like to also count my RDD, is there any way I could include this
> at map1? So that as spark runs through map1, it also does a count? Or would
> count need to be a separate operation such that I would have to run through
> my dataset again. My dataset is really memory intensive so I'd rather not
> cache() it if possible.
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-two-operations-on-the-same-RDD-simultaneously-tp25441.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



How to run two operations on the same RDD simultaneously

2015-11-20 Thread jluan
As far as I understand, operations on rdd's usually come in the form

rdd => map1 => map2 => map2 => (maybe collect)

If I would like to also count my RDD, is there any way I could include this
at map1? So that as spark runs through map1, it also does a count? Or would
count need to be a separate operation such that I would have to run through
my dataset again. My dataset is really memory intensive so I'd rather not
cache() it if possible.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-run-two-operations-on-the-same-RDD-simultaneously-tp25441.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org