I am facing a general problem actually, which seem to be related to how many JVM get launched. In my map task I read a file and fill a map out of it. Now, since the data is static and map tasks are called for every record of RDD and I want to read it only once, so I kept the map as static (in Java) , so that atleast for a single JVM I do not have to do more than one I/O , but keeping it static gives me NPE and sometimes throws exception from somewhere deep inside. (Seems like spark is serializing things here and not able to load static members ) However, not keeping it static runs successfully.
I know I can do it by reading it on master and then broadcasting, but there is a reason I want to do it this way. On Sun, Jan 5, 2014 at 1:43 AM, Archit Thakur <[email protected]>wrote: > ya ya had got that. Thx. > > > On Sun, Jan 5, 2014 at 1:41 AM, Roshan Nair <[email protected]> wrote: > >> The driver jvm is the jvm in which you create the sparkContext and launch >> your job. Its different from the master and worker daemons. >> >> Roshan >> On Jan 5, 2014 1:37 AM, "Archit Thakur" <[email protected]> >> wrote: >> >>> Yeah, I believed that too. >>> >>> The last being the jvm in which your driver runs.??? Isn't it in the 3 >>> worker daemon, we have already considered. >>> >>> >>> On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <[email protected]> wrote: >>> >>>> I missed this. Its actually 1+3+3+1. The last being the jvm in which >>>> your driver runs. >>>> >>>> Roshan >>>> On Jan 5, 2014 1:24 AM, "Roshan Nair" <[email protected]> wrote: >>>> >>>>> Hi Archit, >>>>> >>>>> I believe its the last case - 1+3+3. >>>>> >>>>> From what I've seen its one jvm per worker per spark application. >>>>> >>>>> You will have multiple threads within a worker jvm working on >>>>> different partitions concurrently. The number of partitions that a worker >>>>> handles concurrently appears to be determined by the number of cores >>>>> you've >>>>> set the worker(or app) to use. >>>>> >>>>> You'd have to save to disk and reload an RDD into memory between >>>>> stages, which is why spark won't do that. >>>>> >>>>> Roshan >>>>> On Jan 5, 2014 1:06 AM, "Archit Thakur" <[email protected]> >>>>> wrote: >>>>> >>>>>> A JVM reuse doubt. >>>>>> Lets say I have a job which has 5 stages: >>>>>> Each stage has 10 tasks(10 partitions) Each task has 3 transformation. >>>>>> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be >>>>>> launched? >>>>>> >>>>>> 1-Master Daemon 3-Worker Daemon >>>>>> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 >>>>>> machine, but trasformation done sequentially launching a JVM every >>>>>> transformation for each stage.) >>>>>> OR >>>>>> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine >>>>>> but different stage in different set of JVM) >>>>>> OR >>>>>> 1+3+5*3 (So, JVM will be reused for different partition on single >>>>>> machine but different stage in different set of JVM) >>>>>> OR >>>>>> 1+3+3 (So, One JVM per Worker in any case). >>>>>> OR >>>>>> none >>>>>> >>>>>> Thx, >>>>>> Archit_Thakur. >>>>>> >>>>>> >>>>>> >>> >
