Hello everyone,

I am wondering what the effect of serialization is within a stage.

My understanding of Spark as an execution engine is that the data flow
graph is divided into stages and a new stage always starts after an
operation/transformation that cannot be pipelined (such as groupBy or join)
because it can only be completed after the whole data set has "been taken
care off". At the end of a stage shuffle files are written and at the
beginning of the next stage they are read from.

Within a stage my understanding is that pipelining is used, therefore I
wonder whether there is any serialization overhead involved when there is
no shuffling taking place. I am also assuming that my data set fits into
memory and must not be spilled to disk.

So if I would chain multiple *map* or *flatMap* operations and they end up
in the same stage, will there be any serialization overhead for piping the
result of the first *map* operation as a parameter into the following *map*
operation?

Any ideas and feedback appreciated, thanks a lot.

Best regards,
Mark

Reply via email to