I think a small program that writes the jars to the distributed cache should take care of your issue as mentioned http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html
On Wed, Nov 16, 2011 at 8:33 AM, Something Something < [email protected]> wrote: > Thanks Bejoy & Friso. When I use the all-in-one jar file created by Maven > I get this: > > Mkdirs failed to create > /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license > > > Do you recall coming across this? Our 'all-in-one' jar is not exactly how > you have described it. It doesn't contain any JARs, but it has all the > classes from all the dependent JARs. > > > On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven < > [email protected]> wrote: > > > We usually package my jobs as a single jar that contains a /lib > directory > > in the jar that contains all other jars that the job code depends on. > > Hadoop understands this layout when run as 'hadoop jar'. So the jar > layout > > would be something like: > > > > /META-INF/manifest.mf > > /com/mypackage/MyMapperClass.class > > /com/mypackage/MyReducerClass.class > > /lib/dependency1.jar > > /lib/dependency2.jar > > etc. > > > > If you use Maven or some other build tool with dependency management, > > you can usually produce this jar as part of your build. We also have > Maven > > write the main class to the manifest, such that there is no need to type > > it. So for us, submitting a job looks like: > > hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN > > > > Then Hadoop will take care of submitting and distributing, etc. Of > > course you pay the penalty of always sending all of your dependencies > over > > the wire (the job jar gets replicated to 10 machines by > > default). Pre-distributing sounds tedious and error prone to me. What if > > you have different jobs that require different versions of the same > > dependency? > > > > > > HTH, > > Friso > > > > > > > > > > > > On 16 nov. 2011, at 15:42, Something Something wrote: > > > > Bejoy - Thanks for the reply. The '-libjars' is not working for me with > > 'hadoop jar'. Also, as per the documentation ( > > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar): > > > > Generic Options > > > > The following options are supported by dfsadmin< > http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin > > > > , fs< > http://hadoop.apache.org/common/docs/current/commands_manual.html#fs> > > , fsck< > http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck> > > , job< > http://hadoop.apache.org/common/docs/current/commands_manual.html#job> > > and fetchdt< > http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt> > > . > > > > > > > > Does it work for you? If it does, please let me know. > "Pre-distributing" > > definitely works, but is that the best way? If you have a big cluster > and > > Jars are changing often it will be time-consuming. > > > > Also, how does Pig do it? We update Pig UDFs often and put them only on > > the 'client' machine (machine that starts the Pig job) and the UDF > becomes > > available to all machines in the cluster - automagically! Is Pig doing > the > > pre-distributing for us? > > > > Thanks for your patience & help with our questions. > > > > On Wed, Nov 16, 2011 at 6:29 AM, Something Something < > > [email protected]> wrote: > > > >> Hmm... there must be a different way 'cause we don't need to do that to > >> run Pig jobs. > >> > >> > >> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <[email protected] > >wrote: > >> > >>> There might be different ways but currently we are storing our jars > onto > >>> HDFS and register them from there. They will be copied to the machine > once > >>> the job starts. Is that an option? > >>> > >>> Daan. > >>> > >>> On 16 Nov 2011, at 07:24, Something Something wrote: > >>> > >>> > Until now we were manually copying our Jars to all machines in a > Hadoop > >>> > cluster. This used to work until our cluster size was small. Now > our > >>> > cluster is getting bigger. What's the best way to start a Hadoop Job > >>> that > >>> > automatically distributes the Jar to all machines in a cluster? > >>> > > >>> > I read the doc at: > >>> > > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar > >>> > > >>> > Would -libjars do the trick? But we need to use 'hadoop job' for > that, > >>> > right? Until now, we were using 'hadoop jar' to start all our jobs. > >>> > > >>> > Needless to say, we are getting our feet wet with Hadoop, so > appreciate > >>> > your help with our dumb questions. > >>> > > >>> > Thanks. > >>> > > >>> > PS: We use Pig a lot, which automatically does this, so there must > be > >>> a > >>> > clean way to do this. > >>> > >>> > >> > > > > > -- Thanks, John C
