Thanks a lot guys, that's exactly what I hoped for :-). Cheers, Mark
2015-08-13 6:35 GMT+02:00 Hemant Bhanawat <hemant9...@gmail.com>: > A chain of map and flatmap does not cause any > serialization-deserialization. > > > > On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann <mark.heim...@kard.info> > wrote: > >> Hello everyone, >> >> I am wondering what the effect of serialization is within a stage. >> >> My understanding of Spark as an execution engine is that the data flow >> graph is divided into stages and a new stage always starts after an >> operation/transformation that cannot be pipelined (such as groupBy or join) >> because it can only be completed after the whole data set has "been taken >> care off". At the end of a stage shuffle files are written and at the >> beginning of the next stage they are read from. >> >> Within a stage my understanding is that pipelining is used, therefore I >> wonder whether there is any serialization overhead involved when there is >> no shuffling taking place. I am also assuming that my data set fits into >> memory and must not be spilled to disk. >> >> So if I would chain multiple *map* or *flatMap* operations and they end >> up in the same stage, will there be any serialization overhead for piping >> the result of the first *map* operation as a parameter into the >> following *map* operation? >> >> Any ideas and feedback appreciated, thanks a lot. >> >> Best regards, >> Mark >> > >