Yes. For your custom optimizations (beyond what stock hive gives you) you will have to write custom code.
From: Raajay [mailto:[email protected]] Sent: Wednesday, December 2, 2015 11:18 AM To: [email protected] Subject: Re: Shared object registry I did not write my own processor. I just re-use Tez Work created by Hive. So the processors are classes like HiveMap, HiveJoin defined by Hive. So if I understand the setting correctly, only by modifying these processors can I take advantage of Shared Object Registry. Thanks a lot ! Raajay On Tue, Dec 1, 2015 at 3:39 PM, Bikas Saha <[email protected] <mailto:[email protected]> > wrote: To be clear, you have written your own processor that runs in your DAG vertices? Your processor runs your custom code for processing input data. If yes, then the following applies. You will get access to the registry from your context object. You can use cacheForVertex() to cache for the lifetime of the vertex. cacheForDAG() to cache for the lifetime of the DAG and cacheForSession() to cache for the lifetime of a session (which runs multiple DAGs). As far as the key, value parameters – key is any unique string to look up the value. The value is any Java object (say a map or a list). For performance you would want to cache the object in a form that can be immediately used without any conversion. There is a toy example of the usage in the Tez source code in BroadcastAndOneToOneExample.java The Javadoc for object registry would have more details. Please open a jira if the Javadoc is not clear enough. From: Raajay [mailto:[email protected] <mailto:[email protected]> ] Sent: Tuesday, December 1, 2015 11:02 AM To: [email protected] <mailto:[email protected]> Subject: Re: Shared object registry I am running a custom application; however, the dag is created similar to the dag that Hive would have created for the tpcds query. I use "TezClient" to submit these dags. How can I use Shared Objects explicitly ? I understand that Object Registry provides a key value interface. But then if I want to dump intermediate data (say output of mappers for small jobs) into the shared object registry how shall I do that ? Raajay On Tue, Dec 1, 2015 at 12:47 PM, Bikas Saha <[email protected] <mailto:[email protected]> > wrote: Object registry is a user enabled feature provided by Tez to the application (e.g. Hive and Pig) If the application chooses to use this, then it can do some user land caching across tasks/vertices/dags using it. E.g. hive caches the smaller broadcast side of a broadcast join in the shared object registry. Object registry is not an automatic data caching or input caching mechanism. What application/job are you running? Hive/Pig/Custom? Unless the application (like Hive) has used object caching for a cross dag scenario (which AFAIK it does not) you will not see any difference. If its custom then you will have to explicitly use object registry in a manner that makes sense for your app. -----Original Message----- From: Raajay [mailto:[email protected] <mailto:[email protected]> ] Sent: Tuesday, December 1, 2015 10:36 AM To: [email protected] <mailto:[email protected]> Subject: Shared object registry How to effectively use shared object registry? I created a tez client as a session, and submitted a dag twice sequentially. However, i did not see noticeable difference in their run times. They query was tpcds query#3. I had set enable container reuse in tez-site.xml. Are there other configs i need to ensure are set correctly to use shares objects? - Raajay
