Re: Porting legacy MapReduce application to Tez

Lewis John McGibbney Sat, 19 Dec 2020 21:38:30 -0800

Hi Jonathan,

Thank you for the response. This is very useful.


Using your configuration I am able to execute the Tez examples no problem. The 
issue is when i attempt to run Nutch. No matter what I've tried, the 
dependencies for Nutch are never found.
I've tried building a binary .tar.gz distribution of Nutch and referencing it's 
URI on HDFS... this does not work and I get ClassNotFound exceptions. I've 
tried referencing the Nutch .job artifact which contains all dependencies... 
this does not work. 

Just to confirm, I can successfully execute all Nutch jobs when 
'mapreduce.framework.name' value is set to 'yarn'. We execute the jobs as 
follows

hadoop jar ${NUTCH.job} $CLASS $arguments

I feel like I am very close to getting this running. I wonder if someone on 
this list could make an attempt at running a job and seeing if they can 
reproduce? I've uploaded the compiled .job and the nutch bash script at 
https://drive.google.com/drive/folders/1yjGi8UWVZithcYWLgUINm9v6IU2Scmy5?usp=sharing

You can execute the Injector tool by running 

./nutch inject crawldb urls //assuming that urls is a directory on HDFS 
containing a simple text file with one URL entry i.e. http://tez.apache.org

Again, thank you to you all for any further direction. I am really keen to get 
Nutch running on Tez.

lewismc

On 2020/12/17 18:09:02, Jonathan Eagles <jeag...@gmail.com> wrote: 
> This is what I use in production that has many benefits. In this case
> mapreduce.application.framework.path is the runtime classpath tar.gz file
> that is custom built mapreduce runtime environment, perhaps similar to nutch
> 1) localizing one tar.gz file instead of many individual jars
> 2) minimal jar has fewer class conflicts and a smaller footprint
> 3) localizing tez to tez folder (#tez) allows better control of the
> classpath to avoid java inconsistent classpath resolution of jars in same
> directory
> 4) use cluster hadooplibs false avoids using the jars from the individuals
> nodemanagers and only relies on jars listed in tez.lib.uris
> 
>   <property>
>     <name>mapreduce.application.framework.path</name>
> 
> <value>/hdfs/path/hadoop-mapreduce-${mapreduce.application.framework.version}.tgz#hadoop-mapreduce</value>
>   </property>
> 
>   <property>
>     <name>tez.lib.uris</name>
> 
> <value>/hdfs/path/tez-0.9.2-minimal.tar.gz#tez,${mapreduce.application.framework.path}</value>
>   </property>
>   <property>
>     <name>tez.lib.uris.classpath</name>
>     <value>${mapreduce.application.classpath},./tez/*,./tez/lib/*</value>
>   </property>
>   <property>
>     <name>tez.use.cluster.hadoop-libs</name>
>     <value>false</value>
>   </property>
> 
> On Thu, Dec 17, 2020 at 11:57 AM Lewis John McGibbney <lewi...@apache.org>
> wrote:
> 
> > I tried the following configuration in tez-site.xml with no luck
> >
> > <configuration>
> > <property>
> >   <name>tez.lib.uris</name>
> >
> > <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > </property>
> >
> > <property>
> >   <name>tez.lib.uris.classpath</name>
> >   <value>${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > </property>
> > </configuration>
> >
> > On 2020/12/17 17:35:28, Lewis John McGibbney <lewi...@apache.org> wrote:
> > > Hi Zhiyuan,
> > > Thanks for the guidance. I'm making progress but I am still battling
> > initial configuration management issues.
> > > I'm running HDFS and YARN v3.1.4 in pseudo-mode.
> > > My tez-site.xml contains the following content
> > >
> > > <configuration>
> > > <property>
> > >   <name>tez.lib.uris</name>
> > >
> >  
> > <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch</value>
> > > </property>
> > > </configuration>
> > >
> > > N.B. When I attempted to use the compressed Tez tar.gz, I was running
> > into classpath issues which are largely documented in the installation
> > documentation you pointed me to. I overcame these issues by simply
> > uploading the minimal directory. All seems fine at this stage as I can run
> > all of the Tez examples.
> > >
> > > I run into trouble when I try to run any job from the Nutch application.
> > For example when I run the Injector one of the Nutch plugin extension
> > points (x point org.apache.nutch.net.URLNormalizer) cannot be not found.
> > The relevant log can be seen at https://paste.apache.org/4whoe.
> > > I should note that the entire Nutch .job is available on HDFS at the URI
> > defined in the tez-site.xml above.
> > >
> > > The output of jar -tf on the nutch.job artifact can be seen at
> > https://paste.apache.org/hl8tk.
> > > Am I required to somehow describe the structural heirarchy of this
> > artifact in the tez.lib.uris.classpath configuration property?
> > >
> > > Thank you again for any guidance.
> > >
> > > lewismc
> > >
> > > On 2020/12/14 03:23:48, Zhiyuan Yang <zhiyu...@apache.org> wrote:
> > > > Hi Lewis,
> > > >
> > > > If there is no incompatibility, your existing job will run well on Tez
> > > > without code change. You can just follow this guide
> > > > <https://tez.apache.org/install.html> (especially step 4) to try it
> > out.
> > > >
> > > > Thanks,
> > > > Zhiyuan
> > > >
> > > > On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <
> > lewi...@apache.org>
> > > > wrote:
> > > >
> > >
> > >
> >
>

Re: Porting legacy MapReduce application to Tez

Reply via email to