Hi Spark devs,
I'm coding a spark job and at a certain point in execution I need to send
some data present in an RDD to an external system.
val myRdd =
myRdd.foreach { record =
sendToWhtv(record)
}
The thing is that foreach forces materialization of the RDD and it seems to
be executed
*The thing is that foreach forces materialization of the RDD and it seems
to be executed on the driver program*
What makes you think that? No, foreach is run in the executors
(distributed) and not in the driver.
2015-07-02 18:32 GMT+02:00 Alexandre Rodrigues
alex.jose.rodrig...@gmail.com:
Hi
?
From: Alexandre Rodrigues
Date: Thursday, July 2, 2015 at 12:32 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: Fwd: map vs foreach for sending data to external system
Hi Spark devs,
I'm coding a spark job and at a certain point in execution I need to send some
data present
Foreach is listed as an action[1]. I guess an *action* just means that it
forces materialization of the RDD.
I just noticed much faster executions with map although I don't like the
map approach. I'll look at it with new eyes if foreach is the way to go.
[1] –
Heh, an actions or materializaiton, means that it will trigger the
computation over the RDD. A transformation like map, means that it will
create the transformation chain that must be applied on the data, but it is
actually not executed. It is executed only when an action is triggered over
that
What I'm doing in the RDD is parsing a text file and sending things to the
external system.. I guess that it does that immediately when the action
(count) is triggered instead of being a two step process.
So I guess I should have parsing logic + sending to external system inside
the foreach (with