Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Casey McTaggart Wed, 19 Sep 2012 10:38:47 -0700

including /plugins/classes in plugin.folders made it work. thank you!!!

On Tue, Sep 18, 2012 at 10:58 AM, Walter Tietze <[email protected]> wrote:


> Am 18.09.2012 18:46, schrieb Casey McTaggart:
> > thanks Walter, I still am unable to get anything to run- I think it's
> > because Hadoop is for some reason not finding the tika jar. I tried
> > running Hadoop with -libjars and including both the Nutch jar and the
> > Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch
> > the seed list! When I don't run it with -libjars, it fetches the seed
> > list, then stops with the ClassNotFound exception in the CrawlDatum.
> >
> > I'll try your solution that you just posted. But, any idea why this is
> > happening?
> > thanks!
> > Casey
> >
>
> Hi Casey,
>
>
>
> sry, but I think the changes I mentioned were really all changes I made.
>
> I'll try to check my code again, if I forgot something to post.
>
>
> Remark: I also tried to insert the workaround with the nutch-2.0 code
> base, but was unable to make it work, because nutch-2.0 uses already
> the new Mapreduce classes and seems not to implement the same loading
> mechanism for the plugin repository.
>
>
>
> Any other ideas?
>
>
>
> Cheers, Walter
>
>
> > On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >
> >
> >     Hi,
> >
> >     I had the same problems and couldn't get around in a proper way
> >     satisfyingly.
> >
> >     I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without
> >     MR_v1 and couldn't make it simply work.
> >
> >
> >     But I found a workaround to make nutch 1.5.1 work on CDH4.
> >
> >
> >     Since MR_v2 it is no longer allowed to pack a project as *nutch*.job
> >     altogether and since the former TaskManager is divided into
> >     the ResourceManager and the NodeManager, the NodeManager seems not to
> >     be able to handle the packed nutch-project.
> >
> >     (see also:
> >
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
> >     )
> >
> >
> >     Something one can do, is to unpack the job in the Nodemanager
> manually
> >     and to load the classes from within the code into the current
> >     classloader.
> >
> >     I modified the org/apache/nutch/plugin/PluginManifestParser.java
> >     slightly and everything works fine at least for the moment.
> >
> >
> >     I attached the modified file.
> >
> >
> >     Please remark, I don't have experience yet, if CDH4 removes the
> >     application directories and the unpacked files properly.
> >     You should consider to check the directories, if they are still
> >     needed after the crawl succeeded.
> >
> >
> >
> >     Hope this helps, cheers, Walter
> >
> >
> >
> >
> >     Am 17.09.2012 18:31, schrieb Casey McTaggart:
> >     > I would also like to add that I can run the same crawl locally and
> >     it's
> >     > successful. So, it's just the distributed mode that's not working.
> can
> >     > anyone offer any advice? Do you think it might be something with
> CDH4?
> >     >
> >     > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
> >     > <[email protected] <mailto:[email protected]
> >>wrote:
> >     >
> >     >> Hi everyone,
> >     >>
> >     >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's
> >     version
> >     >> 1.0.1. I can run a local filesystem crawl with Nutch, and it
> >     returns what
> >     >> I'd expect. However, I need to take advantage of the mapreduce
> >     >> functionality, since I want to crawl a local filesystem with many
> >     GB of
> >     >> files. I'm going to put all of these files on an apache server so
> >     they can
> >     >> be crawled. First, though, I want to just crawl a simple website,
> >     and I
> >     >> can't make it work.
> >     >>
> >     >> My urls/seed.txt is on hdfs and is this:
> >     >> http://lucene.apache.org
> >     >>
> >     >> I run this command:
> >     >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
> >     >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
> >     >>
> >     >> Sometimes, it fetches the URL, but does not go beyond depth 1...
> >     and when
> >     >> I examine the CrawlDatum that's in
> >     >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one
> >     entry: the
> >     >> seed url as the key, and the value of the CrawlDatum is
> >     >> _pst_=exception(16), lastModified=0:
> java.lang.NoClassDefFoundError:
> >     >> org/apache/tika/mime/MimeTypeException
> >     >>
> >     >> Okay, so I tried running the command again with -libjars
> >     nutch1.5.1.jar,
> >     >> and it fails with an ArrayIndexOutOfBoundsException. I tried
> >     running it
> >     >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails
> with:
> >     >>
> >     >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records
> >     selected for
> >     >> fetching, exiting ...
> >     >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more
> >     URLs to
> >     >> fetch.
> >     >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your
> >     seed
> >     >> list and URL filters.
> >     >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
> >     >>
> >     >> I tried copying lib/tika-core-1.1.jar to
> >     /usr/local/hadoop-1.0.1/lib, and
> >     >> still 0 URLs are fetched.
> >     >>
> >     >> I'm totally at a loss. can someone help?
> >     >>
> >     >> Here's my regex-urlfilter:
> >     >>
> >     >> # skip file: ftp: and mailto: urls
> >     >> -^(file|ftp|mailto):
> >     >> # skip image and other suffixes we can't yet parse
> >     >> # for a more extensive coverage use the urlfilter-suffix plugin
> >     >>
> >     >>
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
> >     >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >     >> # skip URLs containing certain characters as probable queries,
> etc.
> >     >> -[?*!@=]
> >     >> # skip URLs with slash-delimited segment that repeats 3+ times,
> >     to break
> >     >> loops
> >     >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >     >> # accept anything else
> >     >> +.
> >     >>
> >     >>
> >     >> here's my nutch-site.xml:
> >     >>
> >     >> <configuration>
> >     >>   <property>
> >     >>     <name>http.agent.name <http://http.agent.name></name>
> >     >>     <value>nutchtest</value>
> >     >>   </property>
> >     >>   <property>
> >     >>     <name>plugin.folders</name>
> >     >>
> >     >>
> >
> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
> >     >>   </property>
> >     >> </configuration>
> >     >>
> >     >>
> >     >> which also does not work if I include this part:
> >     >>
> >     >> <property>
> >     >>     <name>plugin.includes</name>
> >     >>
> >     >>
> >
> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
> >     >> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
> >     >>   </property>
> >     >>
> >     >>
> >     >
> >
> >
> >     --
> >
> >     --------------------------------
> >     Walter Tietze
> >     Senior Softwareengineer
> >     Research
> >
> >     Neofonie GmbH
> >     Robert-Koch-Platz 4
> >     10115 Berlin
> >
> >     T +49.30 24627 318 <tel:%2B49.30%2024627%20318>
> >     F +49.30 24627 120 <tel:%2B49.30%2024627%20120>
> >
> >     [email protected] <mailto:[email protected]>
> >     http://www.neofonie.de
> >
> >     Handelsregister
> >     Berlin-Charlottenburg: HRB 67460
> >
> >     Geschäftsführung:
> >     Thomas Kitlitschko
> >     --------------------------------
> >
> >
>
>
> --
>
> --------------------------------
> Walter Tietze
> Senior Softwareengineer
> Research
>
> Neofonie GmbH
> Robert-Koch-Platz 4
> 10115 Berlin
>
> T +49.30 24627 318
> F +49.30 24627 120
>
> [email protected]
> http://www.neofonie.de
>
> Handelsregister
> Berlin-Charlottenburg: HRB 67460
>
> Geschäftsführung:
> Thomas Kitlitschko
> --------------------------------
>
>

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Reply via email to