Hi, My understanding is that these two map functions will end up as a job with one stage (as if you wrote the two maps as a single map) so you really need as much vcores and memory as possible for map1 and map2. I initially thought about dynamic allocation of executors that may or may not help you with the case, but since there's just one stage I don't think you can do much.
Pozdrawiam, Jacek Laskowski ---- https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Fri, Jul 15, 2016 at 9:54 PM, Pavan Achanta <pacha...@sysomos.com> wrote: > Hi All, > > Here is my use case: > > I have a pipeline job consisting of 2 map functions: > > CPU intensive map operation that does not require a lot of memory. > Memory intensive map operation that requires upto 4 GB of memory. And this > 4GB memory cannot be distributed since it is an NLP model. > > Ideally what I like to do is to use 20 nodes with 4 cores each and minimal > memory for first map operation and then use only 3 nodes with minimal CPU > but each having 4GB of memory for 2nd operation. > > While it is possible to control this parallelism for each map operation in > spark. I am not sure how to control the resources for each operation. > Obviously I don’t want to start off the job with 20 nodes with 4 cores and > 4GB memory since I cannot afford that much memory. > > We use Yarn with Spark. Any suggestions ? > > Thanks and regards, > Pavan > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org