I am facing a general problem actually, which seem to be related to how
many JVM get launched.
In my map task I read a file and fill a map out of it.
Now, since the data is static and map tasks are called for every record of
RDD and I want to read it only once, so I kept the map as static (in Java)
, so that atleast for a single JVM I do not have to do more than one I/O ,
but keeping it static gives me NPE and sometimes throws exception from
somewhere deep inside. (Seems like spark is serializing things here and not
able to load static members ) However, not keeping it static runs
successfully.

I know I can do it by reading it on master and then broadcasting, but there
is a reason I want to do it this way.




On Sun, Jan 5, 2014 at 1:43 AM, Archit Thakur <[email protected]>wrote:

> ya ya had got that. Thx.
>
>
> On Sun, Jan 5, 2014 at 1:41 AM, Roshan Nair <[email protected]> wrote:
>
>> The driver jvm is the jvm in which you create the sparkContext and launch
>> your job. Its different from the master and worker daemons.
>>
>> Roshan
>> On Jan 5, 2014 1:37 AM, "Archit Thakur" <[email protected]>
>> wrote:
>>
>>> Yeah, I believed that too.
>>>
>>> The last being the jvm in which your driver runs.??? Isn't it in the 3
>>> worker daemon, we have already considered.
>>>
>>>
>>> On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <[email protected]> wrote:
>>>
>>>> I missed this. Its actually 1+3+3+1. The last being the jvm in which
>>>> your driver runs.
>>>>
>>>> Roshan
>>>> On Jan 5, 2014 1:24 AM, "Roshan Nair" <[email protected]> wrote:
>>>>
>>>>> Hi Archit,
>>>>>
>>>>> I believe its the last case - 1+3+3.
>>>>>
>>>>> From what I've seen its one jvm per worker per spark application.
>>>>>
>>>>> You will have multiple threads within a worker jvm working on
>>>>> different partitions concurrently. The number of partitions that a worker
>>>>> handles concurrently appears to be determined by the number of cores 
>>>>> you've
>>>>> set the worker(or app) to use.
>>>>>
>>>>> You'd have to save to disk and reload an RDD into memory between
>>>>> stages, which is why spark won't do that.
>>>>>
>>>>> Roshan
>>>>> On Jan 5, 2014 1:06 AM, "Archit Thakur" <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> A JVM reuse doubt.
>>>>>> Lets say I have a job which has 5 stages:
>>>>>> Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
>>>>>> My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be
>>>>>> launched?
>>>>>>
>>>>>> 1-Master Daemon 3-Worker Daemon
>>>>>> JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3
>>>>>> machine, but trasformation done sequentially launching a JVM every
>>>>>> transformation for each stage.)
>>>>>> OR
>>>>>> 1+3+5*10 (where at a time 10 will be executed parallely on 3 machine
>>>>>> but different stage in different set of JVM)
>>>>>> OR
>>>>>> 1+3+5*3 (So, JVM will be reused for different partition on single
>>>>>> machine but different stage in different set of JVM)
>>>>>> OR
>>>>>> 1+3+3 (So, One JVM per Worker in any case).
>>>>>> OR
>>>>>> none
>>>>>>
>>>>>> Thx,
>>>>>> Archit_Thakur.
>>>>>>
>>>>>>
>>>>>>
>>>
>

Reply via email to