Ok, so I think I got it to work. I did have to make changes to
JavaActionExecutor though, because of this Hadoop bug. The same bug might not
be present in the most recent version of Hadoop (even though MAPREDUCE-121
isn't marked as resolved) but it is definitely present in 0.20.2-cdh3u4.
As I said the main problem comes from addToCache(), which adds fully qualified
URIs to the distributed cache, and this messes up the classpath parsing later
on when the job starts. So I changed addToCache() so that it only adds URIs
without scheme and authority/host. Obviously this puts the limitation that
everything must be on the same HDFS filesystem as the application.
uri = new URI(filePath);
URI baseUri = appPath.toUri();
/**
* Don't re-resolve cache URIs with the application base URI,
* otherwise the JARs added to the cache end up in the distributed
* cache and in the mapred.job.classpath.files property with colons
* in them, but colon is the classpath delimiter used by Hadoop so
* the jobs ends up with the wrong classpath and can't run
correctly.
*
* if (uri.getScheme() == null) {
* String resolvedPath = uri.getPath();
* if (!resolvedPath.startsWith("/")) {
* resolvedPath = baseUri.getPath() + "/" + resolvedPath;
* }
* uri = new URI(baseUri.getScheme(), baseUri.getAuthority(),
resolvedPath, uri.getQuery(), uri.getFragment());
* }
*
* Instead, simply resolve a potential relative path and create a
* new URI without a scheme and host/authority.
*/
String resolvedPath = uri.getPath();
if (!resolvedPath.startsWith("/")) {
resolvedPath = baseUri.getPath() + "/" + resolvedPath;
}
uri = new URI(null, null, resolvedPath, uri.getQuery(),
uri.getFragment());
if (archive) { ...
The other thing I had to do was to make sure these JARs were ahead of all the
Hadoop distribution JARs (in my situation I need to override Jackson to version
1.9.9 and Hadoop comes with 1.5.2). By default JARs from the user's classpath
will come after the Hadoop distribution's JARs, so that doesn't work.
Thankfully since 0.20.203 (and 0.20.2-cdh3u4 contains that backport) one can
ask for the user's JARs to take precedence, so I added to the end of
createLauncherConf() the following:
// have user and Oozie sharelibs placed in the distributed cache
// take precedence over Hadoop libs
launcherJobConf.setUserClassesTakesPrecedence(true);
This allowed me to get the application's lib JARs and Oozie sharelibs to be
correctly inserted into the job's classpath and take precedence over all the
Hadoop JARs.
I'm not sure which parts of this you guys are interested in integrated into
Oozie, especially since this is all needed because of a bug in Hadoop. Let me
know, and we can work on putting an actual patch together (or you can just use
the snippets above, I don't care).
Thanks for the help,
/Max
--
Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595
www.turn.com
________________________________________
From: Maxime Petazzoni [[email protected]]
Sent: Tuesday, August 06, 2013 3:24 PM
To: [email protected]
Subject: RE: Classpath (and extra JARs) for Java actions
Could this be related to https://issues.apache.org/jira/browse/MAPREDUCE-121 ?
I do see that mapred.job.classpath.files is colon-delimited (with ':') but
Oozie puts fully qualified HDFS paths in there (which contains colons in
hdfs://localhost:9000/path). I can confirm that only the paths that don't have
hdfs://localhost:9000/... at the beginning are correctly seen in the launcher's
classpath.
I'll try to change JavaActionExecutor to make sure things added to the
distributed cache are not added through fully qualified URLs and see if that
works.
/Max
--
Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595
www.turn.com
________________________________________
From: Maxime Petazzoni [[email protected]]
Sent: Tuesday, August 06, 2013 2:31 PM
To: [email protected]
Subject: RE: Classpath (and extra JARs) for Java actions
The Hadoop job config correctly shows all my JARs both in
mapred.job.classpath.files and mapred.cache.files, which indicates that Oozie
knows they are there and did something with it. Yet the classpath listed by the
launcher doesn't show these JARs and my job fails.
It seems cleaner to me to use the shared libs, and I have them deployed on HDFS
and I see them in the listed classpath (apparently through distcache, if I read
the path correctly). But I don't see any of the application's lib JAR files on
the classpath.
Any idea what's up?
/Max
--
Maxime Petazzoni
Sr. Platform Engineer
m 408.310.0595
www.turn.com
________________________________________
From: Virag Kothari [[email protected]]
Sent: Tuesday, August 06, 2013 2:20 PM
To: [email protected]
Subject: Re: Classpath (and extra JARs) for Java actions
Hi Max,
Jars in application lib should be available for all actions. The
documentation might be incorrect.
Did you check your hadoop job config to see which jars are added to
classpath?
Following is the precedence order: 1) Application lib 2) oozie.libpath 3)
oozie.use.system.libpath.
Even though the priority is defined, it is recommended to use only one of
the ways at a time.
Also the switch to no launcher jar is not mandatory and governed by
'oozie.action.ship.launcher.jar' which is
set to true for 4.x. So you are not forced (although recommended) to use
shared lib.
Thanks,
Virag
On 8/6/13 2:01 PM, "Maxime Petazzoni" <[email protected]> wrote:
>Hi all,
>
>My Java action needs some extra JAR files. If I understand the
>documentation (and my testing) correctly, the JARs I placed in the lib/
>folder in my application directory on HDFS are only added to the
>classpath of MapReduce and Pig actions (why not all??).
>
>In the past I used oozie.libpath and that worked pretty well, but now in
>Oozie 4.x with the switch to no launcher jar and the need for the
>sharelibs to be on HDFS, I set oozie.use.system.libpath, which
>apparently doesn't play well when oozie.libpath is also set
>(JARs from oozie.libpath seems to be ignored, but interestingly not
>other file types like text/config files?).
>
>What's the recommended way of having extra JARs for Java actions with
>Oozie? What combination of oozie.libpath, oozie.use.system.libpath
>should I use?
>
>Any help greatly appreciated!
>
>Thanks in advance,
>/Max
>--
>Maxime Petazzoni
>Sr. Platform Engineer
>m 408.310.0595
>www.turn.com