Hi!

This is how I made it work (hadoop 3.1.3, tez 0.10.0), attached to drive:
here
<https://drive.google.com/file/d/1eFMUPSxFpJ0p7fi7IrsI3HAACa4m5s7n/view?usp=sharing>

1.
hdfs dfs -mkdir -p /apps/tez
hdfs dfs -put ~/Applications/apache/tez/tez.tar.gz /apps/tez

hdfs dfs -mkdir /nutch
hdfs dfs -put nutch.tar.gz /nutch

hdfs dfs -mkdir /user/$USER/
echo "https://www.jpl.nasa.gov/news/"; > seed.txt
hdfs dfs -mkdir -p /user/$USER/urls
hdfs dfs -put seed.txt /user/$USER/urls #some examples
hadoop jar apache-nutch-1.18-SNAPSHOT.jar org.apache.nutch.crawl.Injector
crawldb /user/$USER/urls
hadoop jar apache-nutch-1.18-SNAPSHOT.jar org.apache.nutch.crawl.CrawlDb
crawldb

2. Some notes
tez-site.xml that I used is included into the package, but its content is:
<property>
<name>tez.lib.uris</name>
<value>/apps/tez/tez.tar.gz#tez,/nutch/nutch.tar.gz#nutch</value>
</property>
<property>
<name>tez.lib.uris.classpath</name>
<value>./tez/*,./tez/lib/*,./nutch/nutch/*,./nutch/nutch/lib/*,./nutch/nutch/classes/plugins/</value>
</property>
<property>
<name>tez.use.cluster.hadoop-libs</name>
<value>false</value>
</property>
<property>
<name>plugin.folders</name>
<value>nutch/nutch/classes/plugins</value>
</property>
I needed to create a nutch.tar.gz archive, this way tez was able to
localize it through tez.lib.uris to containers running yarn. Jar files are
not decompressed, so for instance lib folder inside jar won't be on
classpath (docs
<https://tez.apache.org/releases/0.9.2/tez-api-javadocs/configs/TezConfiguration.html>)
*tez.lib.uris.classpath*: be aware where the beautiful "/nutch/nutch/*"
comes from (it's just my quick repro): first "/nutch" is because
tez.lib.uris localizes nutch.tar.gz into there according to "#nutch", the
second is because my nutch.tar.gz has an inner structure of "/nutch", so
files end up being copied to PWD/nutch/nutch/* for the container...
you're free to create a nutch.tar.gz without a root folder inside, and
configure accordingly in order to have prettier paths.

*plugin.folders*: plugins should also be pointed properly to the localized
path

Attached tez app logs for a successful injector run.

Regards,
Laszlo Bodor

On Sun, 20 Dec 2020 at 06:38, Lewis John McGibbney <lewi...@apache.org>
wrote:

> Hi Jonathan,
>
> Thank you for the response. This is very useful.
>
> Using your configuration I am able to execute the Tez examples no problem.
> The issue is when i attempt to run Nutch. No matter what I've tried, the
> dependencies for Nutch are never found.
> I've tried building a binary .tar.gz distribution of Nutch and referencing
> it's URI on HDFS... this does not work and I get ClassNotFound exceptions.
> I've tried referencing the Nutch .job artifact which contains all
> dependencies... this does not work.
>
> Just to confirm, I can successfully execute all Nutch jobs when '
> mapreduce.framework.name' value is set to 'yarn'. We execute the jobs as
> follows
>
> hadoop jar ${NUTCH.job} $CLASS $arguments
>
> I feel like I am very close to getting this running. I wonder if someone
> on this list could make an attempt at running a job and seeing if they can
> reproduce? I've uploaded the compiled .job and the nutch bash script at
> https://drive.google.com/drive/folders/1yjGi8UWVZithcYWLgUINm9v6IU2Scmy5?usp=sharing
>
> You can execute the Injector tool by running
>
> ./nutch inject crawldb urls //assuming that urls is a directory on HDFS
> containing a simple text file with one URL entry i.e.
> http://tez.apache.org
>
> Again, thank you to you all for any further direction. I am really keen to
> get Nutch running on Tez.
>
> lewismc
>
> On 2020/12/17 18:09:02, Jonathan Eagles <jeag...@gmail.com> wrote:
> > This is what I use in production that has many benefits. In this case
> > mapreduce.application.framework.path is the runtime classpath tar.gz file
> > that is custom built mapreduce runtime environment, perhaps similar to
> nutch
> > 1) localizing one tar.gz file instead of many individual jars
> > 2) minimal jar has fewer class conflicts and a smaller footprint
> > 3) localizing tez to tez folder (#tez) allows better control of the
> > classpath to avoid java inconsistent classpath resolution of jars in same
> > directory
> > 4) use cluster hadooplibs false avoids using the jars from the
> individuals
> > nodemanagers and only relies on jars listed in tez.lib.uris
> >
> >   <property>
> >     <name>mapreduce.application.framework.path</name>
> >
> >
> <value>/hdfs/path/hadoop-mapreduce-${mapreduce.application.framework.version}.tgz#hadoop-mapreduce</value>
> >   </property>
> >
> >   <property>
> >     <name>tez.lib.uris</name>
> >
> >
> <value>/hdfs/path/tez-0.9.2-minimal.tar.gz#tez,${mapreduce.application.framework.path}</value>
> >   </property>
> >   <property>
> >     <name>tez.lib.uris.classpath</name>
> >     <value>${mapreduce.application.classpath},./tez/*,./tez/lib/*</value>
> >   </property>
> >   <property>
> >     <name>tez.use.cluster.hadoop-libs</name>
> >     <value>false</value>
> >   </property>
> >
> > On Thu, Dec 17, 2020 at 11:57 AM Lewis John McGibbney <
> lewi...@apache.org>
> > wrote:
> >
> > > I tried the following configuration in tez-site.xml with no luck
> > >
> > > <configuration>
> > > <property>
> > >   <name>tez.lib.uris</name>
> > >
> > >
> <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > > </property>
> > >
> > > <property>
> > >   <name>tez.lib.uris.classpath</name>
> > >
>  <value>${fs.defaultFS}/apps/nutch/apache-nutch-1.18-SNAPSHOT.job</value>
> > > </property>
> > > </configuration>
> > >
> > > On 2020/12/17 17:35:28, Lewis John McGibbney <lewi...@apache.org>
> wrote:
> > > > Hi Zhiyuan,
> > > > Thanks for the guidance. I'm making progress but I am still battling
> > > initial configuration management issues.
> > > > I'm running HDFS and YARN v3.1.4 in pseudo-mode.
> > > > My tez-site.xml contains the following content
> > > >
> > > > <configuration>
> > > > <property>
> > > >   <name>tez.lib.uris</name>
> > > >
> > >
> <value>${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT,${fs.defaultFS}/apps/tez-0.10.1-SNAPSHOT/lib,${fs.defaultFS}/apps/nutch</value>
> > > > </property>
> > > > </configuration>
> > > >
> > > > N.B. When I attempted to use the compressed Tez tar.gz, I was running
> > > into classpath issues which are largely documented in the installation
> > > documentation you pointed me to. I overcame these issues by simply
> > > uploading the minimal directory. All seems fine at this stage as I can
> run
> > > all of the Tez examples.
> > > >
> > > > I run into trouble when I try to run any job from the Nutch
> application.
> > > For example when I run the Injector one of the Nutch plugin extension
> > > points (x point org.apache.nutch.net.URLNormalizer) cannot be not
> found.
> > > The relevant log can be seen at https://paste.apache.org/4whoe.
> > > > I should note that the entire Nutch .job is available on HDFS at the
> URI
> > > defined in the tez-site.xml above.
> > > >
> > > > The output of jar -tf on the nutch.job artifact can be seen at
> > > https://paste.apache.org/hl8tk.
> > > > Am I required to somehow describe the structural heirarchy of this
> > > artifact in the tez.lib.uris.classpath configuration property?
> > > >
> > > > Thank you again for any guidance.
> > > >
> > > > lewismc
> > > >
> > > > On 2020/12/14 03:23:48, Zhiyuan Yang <zhiyu...@apache.org> wrote:
> > > > > Hi Lewis,
> > > > >
> > > > > If there is no incompatibility, your existing job will run well on
> Tez
> > > > > without code change. You can just follow this guide
> > > > > <https://tez.apache.org/install.html> (especially step 4) to try
> it
> > > out.
> > > > >
> > > > > Thanks,
> > > > > Zhiyuan
> > > > >
> > > > > On Mon, Dec 14, 2020 at 9:04 AM Lewis John McGibbney <
> > > lewi...@apache.org>
> > > > > wrote:
> > > > >
> > > >
> > > >
> > >
> >
>

Attachment: application_1608498768395_0036.tar.gz
Description: GNU Zip compressed data

Reply via email to