>> The primary job, which implements Tool, is able to run, it's just the jobs launched by the doFn() which fail.
You mean from the pipeline.run/done calls right and not actually DoFn? The reason I'm asking is if you are launching jobs in a DoFn then that might relate to some issues. As far as Oozie and Crunch integration, typically you specify the driver class when creating the MRPipeline instance. This will help find the jar containing the driver and automatically push it to DistributedCache. If the jar has more dependencies needed to run I believe those need to be specified through the "-libjars" argument when launching.[1] This should flow through the Configuration object that Tool/ToolRunner pass in and you ideally use to create your Pipeline. I haven't checked it out in a bit but you could look at tools like Kite which has a maven plugin which can help to generate the "-libjars" command line options and would handle DistributedCache for you. Last I looked i had some limitations out of the box but could be a pattern you could emulate.[3] Crunch does has a class DistCache[3] that has a few convenience methods for pushing those files into HDFS. [1] - http://stackoverflow.com/questions/23862309/oozie-throws-java-lang-classnotfoundexception [2] - http://kitesdk.org/docs/current/kite-maven-plugin/package-app-mojo.html [3] - http://crunch.apache.org/apidocs/0.11.0/org/apache/crunch/util/DistCache.html On Tue, Dec 2, 2014 at 4:07 PM, Mike Barretta <[email protected]> wrote: > FWIW, I solved this by manually adding all necessary jars into the > DistributedCache...ugly, but effective! > > On Wed, Nov 26, 2014 at 12:29 PM, Mike Barretta <[email protected]> > wrote: > >> Thank you for the quick reply. >> >> I am indeed using the Oozie workflow lib directory as described here: >> http://oozie.apache.org/docs/3.3.2/WorkflowFunctionalSpec.html#a7_Workflow_Application_Deployment. >> >> >> The primary job, which implements Tool, is able to run, it's just the >> jobs launched by the doFn() which fail. Is there a step where I might need >> to tell the Crunch pipeline about the jars loaded by Oozie? >> >> On Fri, Nov 21, 2014 at 5:27 PM, Micah Whitacre <[email protected]> >> wrote: >> >>> The support of a lib folder inside of a jar is not necessarily >>> guaranteed to be supported on all versions of Hadoop.[1] >>> >>> We typically go with the "uber" jar where we use maven-shade-plugin to >>> actually explode the crunch dependencies and others into the assembly jar. >>> Another approach since you are using Oozie is to include the jar in the >>> workflow lib directory. That should put the jar on the classpath. The >>> last approach is obviously to manually use DistributedCache yourself which >>> will distribute it out to the cluster. >>> >>> [1] - >>> http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ >>> >>> On Fri, Nov 21, 2014 at 4:15 PM, Mike Barretta <[email protected]> >>> wrote: >>> >>>> All, >>>> >>>> I'm running an MRPipeline from crunch-core 0.11.0-hadoop2 on a CDH5.1 >>>> cluster via oozie. While the main job runs okay, the doFn() it calls fails >>>> due to the CNFE. The jar containing my classes does indeed contain >>>> lib/crunch-core-0.11.0-hadoop2.jar. >>>> >>>> Does the crunch jar need to be added to the hadoop lib on all nodes? >>>> It seems like that would/should be unnecessary. >>>> >>>> Thanks, >>>> Mike >>>> >>> >>> >> >
