RE: Shared object registry

Bikas Saha Wed, 02 Dec 2015 12:43:53 -0800

Yes. For your custom optimizations (beyond what stock hive gives you) you will 
have to write custom code.

From: Raajay [mailto:[email protected]] 
Sent: Wednesday, December 2, 2015 11:18 AM
To: [email protected]
Subject: Re: Shared object registry

I did not write my own processor. I just re-use Tez Work created by Hive. So 
the processors are classes like HiveMap, HiveJoin defined by Hive.

So if I understand the setting correctly, only by modifying these processors 
can I take advantage of Shared Object Registry.

Thanks a lot ! 

Raajay

On Tue, Dec 1, 2015 at 3:39 PM, Bikas Saha <[email protected] 
<mailto:[email protected]> > wrote:

To be clear, you have written your own processor that runs in your DAG 
vertices? Your processor runs your custom code for processing input data.

If yes, then the following applies.

You will get access to the registry from your context object.

You can use cacheForVertex() to cache for the lifetime of the vertex. 
cacheForDAG() to cache for the lifetime of the DAG and cacheForSession() to 
cache for the lifetime of a session (which runs multiple DAGs). As far as the 
key, value parameters – key is any unique string to look up the value. The 
value is any Java object (say a map or a list). For performance you would want 
to cache the object in a form that can be immediately used without any 
conversion.

There is a toy example of the usage in the Tez source code in 
BroadcastAndOneToOneExample.java

The Javadoc for object registry would have more details. Please open a jira if 
the Javadoc is not clear enough.

From: Raajay [mailto:[email protected] <mailto:[email protected]> ] 
Sent: Tuesday, December 1, 2015 11:02 AM
To: [email protected] <mailto:[email protected]> 
Subject: Re: Shared object registry

I am running a custom application; however, the dag is created similar to the 
dag that Hive would have created for the tpcds query. I use "TezClient" to 
submit these dags.

How can I use Shared Objects explicitly ?

I understand that Object Registry provides a key value interface. But then if I 
want to dump intermediate data (say output of mappers for small jobs) into the 
shared object registry how shall I do that ?

Raajay

On Tue, Dec 1, 2015 at 12:47 PM, Bikas Saha <[email protected] 
<mailto:[email protected]> > wrote:

Object registry is a user enabled feature provided by Tez to the application
(e.g. Hive and Pig) If the application chooses to use this, then it can do
some user land caching across tasks/vertices/dags using it. E.g. hive caches
the smaller broadcast side of a broadcast join in the shared object
registry.

Object registry is not an automatic data caching or input caching mechanism.

What application/job are you running? Hive/Pig/Custom? Unless the
application (like Hive) has used object caching for a cross dag scenario
(which AFAIK it does not) you will not see any difference. If its custom
then you will have to explicitly use object registry in a manner that makes
sense for your app.

-----Original Message-----
From: Raajay [mailto:[email protected] <mailto:[email protected]> ]
Sent: Tuesday, December 1, 2015 10:36 AM
To: [email protected] <mailto:[email protected]> 
Subject: Shared object registry

How to effectively use shared object registry?

I created a tez client as a session, and submitted a dag twice sequentially.

However, i did not see noticeable difference in their run times. They query
was tpcds query#3.

I had set enable container reuse in tez-site.xml. Are there other configs i
need to ensure are set correctly to use shares objects?

- Raajay

RE: Shared object registry

Reply via email to