Ok, so I think I got it to work. I did have to make changes to 
JavaActionExecutor though, because of this Hadoop bug. The same bug might not 
be present in the most recent version of Hadoop (even though MAPREDUCE-121 
isn't marked as resolved) but it is definitely present in 0.20.2-cdh3u4.

As I said the main problem comes from addToCache(), which adds fully qualified 
URIs to the distributed cache, and this messes up the classpath parsing later 
on when the job starts. So I changed addToCache() so that it only adds URIs 
without scheme and authority/host. Obviously this puts the limitation that 
everything must be on the same HDFS filesystem as the application.

            uri = new URI(filePath);
            URI baseUri = appPath.toUri();

            /**
             * Don't re-resolve cache URIs with the application base URI,
             * otherwise the JARs added to the cache end up in the distributed
             * cache and in the mapred.job.classpath.files property with colons
             * in them, but colon is the classpath delimiter used by Hadoop so
             * the jobs ends up with the wrong classpath and can't run 
correctly.
             *
             * if (uri.getScheme() == null) {
             *     String resolvedPath = uri.getPath();
             *     if (!resolvedPath.startsWith("/")) {
             *         resolvedPath = baseUri.getPath() + "/" + resolvedPath;
             *     }
             *     uri = new URI(baseUri.getScheme(), baseUri.getAuthority(), 
resolvedPath, uri.getQuery(), uri.getFragment());
             * }
             *
             * Instead, simply resolve a potential relative path and create a
             * new URI without a scheme and host/authority.
             */

            String resolvedPath = uri.getPath();
            if (!resolvedPath.startsWith("/")) {
                resolvedPath = baseUri.getPath() + "/" + resolvedPath;
            }
            uri = new URI(null, null, resolvedPath, uri.getQuery(), 
uri.getFragment());

            if (archive) { ...

The other thing I had to do was to make sure these JARs were ahead of all the 
Hadoop distribution JARs (in my situation I need to override Jackson to version 
1.9.9 and Hadoop comes with 1.5.2). By default JARs from the user's classpath 
will come after the Hadoop distribution's JARs, so that doesn't work. 
Thankfully since 0.20.203 (and 0.20.2-cdh3u4 contains that backport) one can 
ask for the user's JARs to take precedence, so I added to the end of 
createLauncherConf() the following:

            // have user and Oozie sharelibs placed in the distributed cache
            // take precedence over Hadoop libs
            launcherJobConf.setUserClassesTakesPrecedence(true);

This allowed me to get the application's lib JARs and Oozie sharelibs to be 
correctly inserted into the job's classpath and take precedence over all the 
Hadoop JARs.

I'm not sure which parts of this you guys are interested in integrated into 
Oozie, especially since this is all needed because of a bug in Hadoop. Let me 
know, and we can work on putting an actual patch together (or you can just use 
the snippets above, I don't care).

Thanks for the help,
/Max
--
Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595
www.turn.com

________________________________________
From: Maxime Petazzoni [[email protected]]
Sent: Tuesday, August 06, 2013 3:24 PM
To: [email protected]
Subject: RE: Classpath (and extra JARs) for Java actions

Could this be related to https://issues.apache.org/jira/browse/MAPREDUCE-121 ?

I do see that mapred.job.classpath.files is colon-delimited (with ':') but 
Oozie puts fully qualified HDFS paths in there (which contains colons in 
hdfs://localhost:9000/path). I can confirm that only the paths that don't have 
hdfs://localhost:9000/... at the beginning are correctly seen in the launcher's 
classpath.

I'll try to change JavaActionExecutor to make sure things added to the 
distributed cache are not added through fully qualified URLs and see if that 
works.

/Max
--
Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595
www.turn.com

________________________________________
From: Maxime Petazzoni [[email protected]]
Sent: Tuesday, August 06, 2013 2:31 PM
To: [email protected]
Subject: RE: Classpath (and extra JARs) for Java actions

The Hadoop job config correctly shows all my JARs both in 
mapred.job.classpath.files and mapred.cache.files, which indicates that Oozie 
knows they are there and did something with it. Yet the classpath listed by the 
launcher doesn't show these JARs and my job fails.

It seems cleaner to me to use the shared libs, and I have them deployed on HDFS 
and I see them in the listed classpath (apparently through distcache, if I read 
the path correctly). But I don't see any of the application's lib JAR files on 
the classpath.

Any idea what's up?
/Max
--
Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595
www.turn.com

________________________________________
From: Virag Kothari [[email protected]]
Sent: Tuesday, August 06, 2013 2:20 PM
To: [email protected]
Subject: Re: Classpath (and extra JARs) for Java actions

Hi Max,

Jars in application lib should be available for all actions. The
documentation might be incorrect.
Did you check your hadoop job config to see which jars are added to
classpath?

Following is the precedence order: 1) Application lib 2) oozie.libpath 3)
oozie.use.system.libpath.
Even though the priority is defined, it is recommended to use only one of
the ways at a time.
Also the switch to no launcher jar is not mandatory and governed by
'oozie.action.ship.launcher.jar' which is
set to true for 4.x. So you are not forced (although recommended) to use
shared lib.

Thanks,
Virag



On 8/6/13 2:01 PM, "Maxime Petazzoni" <[email protected]> wrote:

>Hi all,
>
>My Java action needs some extra JAR files. If I understand the
>documentation (and my testing) correctly, the JARs I placed in the lib/
>folder in my application directory on HDFS are only added to the
>classpath of MapReduce and Pig actions (why not all??).
>
>In the past I used oozie.libpath and that worked pretty well, but now in
>Oozie 4.x with the switch to no launcher jar and the need for the
>sharelibs to be on HDFS, I set oozie.use.system.libpath, which
>apparently doesn't play well when oozie.libpath is also set
>(JARs from oozie.libpath seems to be ignored, but interestingly not
>other file types like text/config files?).
>
>What's the recommended way of having extra JARs for Java actions with
>Oozie? What combination of oozie.libpath, oozie.use.system.libpath
>should I use?
>
>Any help greatly appreciated!
>
>Thanks in advance,
>/Max
>--
>Maxime Petazzoni
>Sr. Platform Engineer
>m 408.310.0595
>www.turn.com

Reply via email to