Re: Shared resources

Alan Gates Wed, 02 Mar 2011 08:17:46 -0800

There is no shared inter-task processing in Hadoop. Each task runs ina separate JVM and is locked off from all other tasks. This is partlybecause you do not know a priori which tasks will run together inwhich nodes, and partly for security. Data can be shared by all taskson a node via the distributed cache. If all your work could be doneonce on the front end and then serialized to be later read by alltasks you could use this mechanism to share it. With the code intrunk UDFs can store data to the distributed cache, though thisfeature is not in a release yet.


Alan.


On Mar 2, 2011, at 7:54 AM, Lai Will wrote:

So I still get the redundant work whenever the same clusternode/vmcreates multiple instances of my EvalFunc?And is it usual to have several instance of the EvalFunc on the sameclusternode/vm?
Will

-----Original Message-----
From: Alan Gates [mailto:[email protected]]
Sent: Wednesday, March 02, 2011 4:49 PM
To: [email protected]
Subject: Re: Shared resources
There is no method in the eval func that gets called on the backendbefore any exec calls. You can keep a flag that tracks whether youhave done the initialization so that you only do it the first time.
Alan.

On Mar 2, 2011, at 5:29 AM, Lai Will wrote:
Hello,

I wrote a EvalFunc implementation that


1)      Parses a SQL Query

2)      Scans a folder for resource files and creates an index on
these files

3)      According to certain properties of the SQL Query accesses
the corresponding file and creates a Java objects holding relevantthe
information of the file (for reuse).

4)      Does some computation with the SQL Query and the information
found in the file

5)      Outputs a transformed SQL Query

Currently I'm doing local tests without Hadoop and the code works
fine.

The problem I see, is that right now I initialize my parser in the
EvalFunc, so that every time It gets instantiated a new instance of
the parser is generated. Ideally only on instance per machine wouldbe
created.
Even worse right now I create the index and parse the corresponding
resource file once per call exec in EvalFunc  and therefore do a lot
of redundant computation.

Just because I don't know where and how to put this shared
computation.
Does anybody have a solution on that?

Best,
Will

Re: Shared resources

Reply via email to