Re: Shared resources

Alan Gates Wed, 02 Mar 2011 09:27:47 -0800

Within a given task, a UDF is only instantiated once. For maps andreduces this should mean one per map or reduce. Since the combinercan be run multiple times there can be multiple instantiations percombine. But the warning on number of instantiations is about howmany times the UDF is constructed on the front end (by which I meancompile time). Pig use to construct the UDF multiple times in thefront end. Now we have it down to one construction on the front endand one per task in the backend.


Alan.


On Mar 2, 2011, at 9:11 AM, Lai Will wrote:

I understand that the is not inter-task communication at all.
However my question arises within one task. The documentation saysthat we should not make any assumptions on how may EvalFuncinstances (of the same class) are instantiated.Therefore I assume that within the same task, there might be severalinstances of my EvalFunc and if every one of them is doing theparsing of resource files into data structures a lot of memory andcomputing power would be wasted.. so it's not about inter-taskcommunication but about inter-instance communication.
Thank you for your help.

Best,
Will

-----Original Message-----
From: Alan Gates [mailto:[email protected]]
Sent: Wednesday, March 02, 2011 5:17 PM
To: [email protected]
Subject: Re: Shared resources
There is no shared inter-task processing in Hadoop. Each task runsin a separate JVM and is locked off from all other tasks. This ispartly because you do not know a priori which tasks will runtogether in which nodes, and partly for security. Data can beshared by all tasks on a node via the distributed cache. If allyour work could be done once on the front end and then serialized tobe later read by all tasks you could use this mechanism to shareit. With the code in trunk UDFs can store data to the distributedcache, though this feature is not in a release yet.
Alan.

On Mar 2, 2011, at 7:54 AM, Lai Will wrote:
So I still get the redundant work whenever the same clusternode/vm
creates multiple instances of my EvalFunc?
And is it usual to have several instance of the EvalFunc on the same
clusternode/vm?

Will

-----Original Message-----
From: Alan Gates [mailto:[email protected]]
Sent: Wednesday, March 02, 2011 4:49 PM
To: [email protected]
Subject: Re: Shared resources

There is no method in the eval func that gets called on the backend
before any exec calls.  You can keep a flag that tracks whether you
have done the initialization so that you only do it the first time.

Alan.

On Mar 2, 2011, at 5:29 AM, Lai Will wrote:
Hello,

I wrote a EvalFunc implementation that


1)      Parses a SQL Query

2)      Scans a folder for resource files and creates an index on
these files

3)      According to certain properties of the SQL Query accesses
the corresponding file and creates a Java objects holding relevant
the information of the file (for reuse).

4)      Does some computation with the SQL Query and the information
found in the file

5)      Outputs a transformed SQL Query

Currently I'm doing local tests without Hadoop and the code works
fine.

The problem I see, is that right now I initialize my parser in the
EvalFunc, so that every time It gets instantiated a new instance of
the parser is generated. Ideally only on instance per machine would
be created.
Even worse right now I create the index and parse the corresponding
resource file once per call exec in EvalFunc  and therefore do a lot
of redundant computation.

Just because I don't know where and how to put this shared
computation.
Does anybody have a solution on that?

Best,
Will

Re: Shared resources

Reply via email to