Re: Distributing our jars to all machines in a cluster

John Conwell Wed, 16 Nov 2011 08:58:30 -0800

I think a small program that writes the jars to the distributed cache
should take care of your issue as mentioned
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/filecache/DistributedCache.html



On Wed, Nov 16, 2011 at 8:33 AM, Something Something <
[email protected]> wrote:

> Thanks Bejoy & Friso.  When I use the all-in-one jar file created by Maven
> I get this:
>
> Mkdirs failed to create
> /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license
>
>
> Do you recall coming across this?  Our 'all-in-one' jar is not exactly how
> you have described it.  It doesn't contain any JARs, but it has all the
> classes from all the dependent JARs.
>
>
> On Wed, Nov 16, 2011 at 7:59 AM, Friso van Vollenhoven <
> [email protected]> wrote:
>
> >  We usually package my jobs as a single jar that contains a /lib
> directory
> > in the jar that contains all other jars that the job code depends on.
> > Hadoop understands this layout when run as 'hadoop jar'. So the jar
> layout
> > would be something like:
> >
> > /META-INF/manifest.mf
> >  /com/mypackage/MyMapperClass.class
> >  /com/mypackage/MyReducerClass.class
> >  /lib/dependency1.jar
> >  /lib/dependency2.jar
> >  etc.
> >
> >  If you use Maven or some other build tool with dependency management,
> > you can usually produce this jar as part of your build. We also have
> Maven
> > write the main class to the manifest, such that there is no need to type
> > it. So for us, submitting a job looks like:
> > hadoop jar jar-with-all-deps-in-lib.jar arg1 arg2 argN
> >
> >  Then Hadoop will take care of submitting and distributing, etc. Of
> > course you pay the penalty of always sending all of your dependencies
> over
> > the wire (the job jar gets replicated to 10 machines by
> > default). Pre-distributing sounds tedious and error prone to me. What if
> > you have different jobs that require different versions of the same
> > dependency?
> >
> >
> >  HTH,
> > Friso
> >
> >
> >
> >
> >
> >  On 16 nov. 2011, at 15:42, Something Something wrote:
> >
> > Bejoy - Thanks for the reply.  The '-libjars' is not working for me with
> > 'hadoop jar'.  Also, as per the documentation (
> > http://hadoop.apache.org/common/docs/current/commands_manual.html#jar):
> >
> >  Generic Options
> >
> > The following options are supported by dfsadmin<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#dfsadmin
> >
> > , fs<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#fs>
> > , fsck<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#fsck>
> > , job<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#job>
> >  and fetchdt<
> http://hadoop.apache.org/common/docs/current/commands_manual.html#fetchdt>
> > .
> >
> >
> >
> > Does it work for you?  If it does, please let me know.
>  "Pre-distributing"
> > definitely works, but is that the best way?  If you have a big cluster
> and
> > Jars are changing often it will be time-consuming.
> >
> > Also, how does Pig do it?  We update Pig UDFs often and put them only on
> > the 'client' machine (machine that starts the Pig job) and the UDF
> becomes
> > available to all machines in the cluster - automagically!  Is Pig doing
> the
> > pre-distributing for us?
> >
> > Thanks for your patience & help with our questions.
> >
> >  On Wed, Nov 16, 2011 at 6:29 AM, Something Something <
> > [email protected]> wrote:
> >
> >> Hmm... there must be a different way 'cause we don't need to do that to
> >> run Pig jobs.
> >>
> >>
> >> On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits <[email protected]
> >wrote:
> >>
> >>> There might be different ways but currently we are storing our jars
> onto
> >>> HDFS and register them from there. They will be copied to the machine
> once
> >>> the job starts. Is that an option?
> >>>
> >>> Daan.
> >>>
> >>> On 16 Nov 2011, at 07:24, Something Something wrote:
> >>>
> >>> > Until now we were manually copying our Jars to all machines in a
> Hadoop
> >>> > cluster.  This used to work until our cluster size was small.  Now
> our
> >>> > cluster is getting bigger.  What's the best way to start a Hadoop Job
> >>> that
> >>> > automatically distributes the Jar to all machines in a cluster?
> >>> >
> >>> > I read the doc at:
> >>> >
> http://hadoop.apache.org/common/docs/current/commands_manual.html#jar
> >>> >
> >>> > Would -libjars do the trick?  But we need to use 'hadoop job' for
> that,
> >>> > right?  Until now, we were using 'hadoop jar' to start all our jobs.
> >>> >
> >>> > Needless to say, we are getting our feet wet with Hadoop, so
> appreciate
> >>> > your help with our dumb questions.
> >>> >
> >>> > Thanks.
> >>> >
> >>> > PS:  We use Pig a lot, which automatically does this, so there must
> be
> >>> a
> >>> > clean way to do this.
> >>>
> >>>
> >>
> >
> >
>



-- 

Thanks,
John C

Re: Distributing our jars to all machines in a cluster

Reply via email to