Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Walter Tietze Tue, 18 Sep 2012 09:59:13 -0700

Am 18.09.2012 18:46, schrieb Casey McTaggart:
> thanks Walter, I still am unable to get anything to run- I think it's
> because Hadoop is for some reason not finding the tika jar. I tried
> running Hadoop with -libjars and including both the Nutch jar and the
> Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch
> the seed list! When I don't run it with -libjars, it fetches the seed
> list, then stops with the ClassNotFound exception in the CrawlDatum.
> 
> I'll try your solution that you just posted. But, any idea why this is
> happening?
> thanks!
> Casey
>


Hi Casey,



sry, but I think the changes I mentioned were really all changes I made.

I'll try to check my code again, if I forgot something to post.


Remark: I also tried to insert the workaround with the nutch-2.0 code
base, but was unable to make it work, because nutch-2.0 uses already
the new Mapreduce classes and seems not to implement the same loading
mechanism for the plugin repository.



Any other ideas?



Cheers, Walter


> On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <[email protected]
> <mailto:[email protected]>> wrote:
> 
> 
> 
>     Hi,
> 
>     I had the same problems and couldn't get around in a proper way
>     satisfyingly.
> 
>     I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without
>     MR_v1 and couldn't make it simply work.
> 
> 
>     But I found a workaround to make nutch 1.5.1 work on CDH4.
> 
> 
>     Since MR_v2 it is no longer allowed to pack a project as *nutch*.job
>     altogether and since the former TaskManager is divided into
>     the ResourceManager and the NodeManager, the NodeManager seems not to
>     be able to handle the packed nutch-project.
> 
>     (see also:
>     
> http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
>     )
> 
> 
>     Something one can do, is to unpack the job in the Nodemanager manually
>     and to load the classes from within the code into the current
>     classloader.
> 
>     I modified the org/apache/nutch/plugin/PluginManifestParser.java
>     slightly and everything works fine at least for the moment.
> 
> 
>     I attached the modified file.
> 
> 
>     Please remark, I don't have experience yet, if CDH4 removes the
>     application directories and the unpacked files properly.
>     You should consider to check the directories, if they are still
>     needed after the crawl succeeded.
> 
> 
> 
>     Hope this helps, cheers, Walter
> 
> 
> 
> 
>     Am 17.09.2012 18:31, schrieb Casey McTaggart:
>     > I would also like to add that I can run the same crawl locally and
>     it's
>     > successful. So, it's just the distributed mode that's not working. can
>     > anyone offer any advice? Do you think it might be something with CDH4?
>     >
>     > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart
>     > <[email protected] <mailto:[email protected]>>wrote:
>     >
>     >> Hi everyone,
>     >>
>     >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's
>     version
>     >> 1.0.1. I can run a local filesystem crawl with Nutch, and it
>     returns what
>     >> I'd expect. However, I need to take advantage of the mapreduce
>     >> functionality, since I want to crawl a local filesystem with many
>     GB of
>     >> files. I'm going to put all of these files on an apache server so
>     they can
>     >> be crawled. First, though, I want to just crawl a simple website,
>     and I
>     >> can't make it work.
>     >>
>     >> My urls/seed.txt is on hdfs and is this:
>     >> http://lucene.apache.org
>     >>
>     >> I run this command:
>     >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job
>     >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl
>     >>
>     >> Sometimes, it fetches the URL, but does not go beyond depth 1...
>     and when
>     >> I examine the CrawlDatum that's in
>     >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one
>     entry: the
>     >> seed url as the key, and the value of the CrawlDatum is
>     >> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError:
>     >> org/apache/tika/mime/MimeTypeException
>     >>
>     >> Okay, so I tried running the command again with -libjars
>     nutch1.5.1.jar,
>     >> and it fails with an ArrayIndexOutOfBoundsException. I tried
>     running it
>     >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with:
>     >>
>     >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records
>     selected for
>     >> fetching, exiting ...
>     >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more
>     URLs to
>     >> fetch.
>     >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your
>     seed
>     >> list and URL filters.
>     >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl
>     >>
>     >> I tried copying lib/tika-core-1.1.jar to
>     /usr/local/hadoop-1.0.1/lib, and
>     >> still 0 URLs are fetched.
>     >>
>     >> I'm totally at a loss. can someone help?
>     >>
>     >> Here's my regex-urlfilter:
>     >>
>     >> # skip file: ftp: and mailto: urls
>     >> -^(file|ftp|mailto):
>     >> # skip image and other suffixes we can't yet parse
>     >> # for a more extensive coverage use the urlfilter-suffix plugin
>     >>
>     >>
>     
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ
>     >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>     >> # skip URLs containing certain characters as probable queries, etc.
>     >> -[?*!@=]
>     >> # skip URLs with slash-delimited segment that repeats 3+ times,
>     to break
>     >> loops
>     >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>     >> # accept anything else
>     >> +.
>     >>
>     >>
>     >> here's my nutch-site.xml:
>     >>
>     >> <configuration>
>     >>   <property>
>     >>     <name>http.agent.name <http://http.agent.name></name>
>     >>     <value>nutchtest</value>
>     >>   </property>
>     >>   <property>
>     >>     <name>plugin.folders</name>
>     >>
>     >>
>     
> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value>
>     >>   </property>
>     >> </configuration>
>     >>
>     >>
>     >> which also does not work if I include this part:
>     >>
>     >> <property>
>     >>     <name>plugin.includes</name>
>     >>
>     >>
>     
> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor)
>     >> |query-(basic|site|url)|response-(json|xml)|addhdfskey</value>
>     >>   </property>
>     >>
>     >>
>     >
> 
> 
>     --
> 
>     --------------------------------
>     Walter Tietze
>     Senior Softwareengineer
>     Research
> 
>     Neofonie GmbH
>     Robert-Koch-Platz 4
>     10115 Berlin
> 
>     T +49.30 24627 318 <tel:%2B49.30%2024627%20318>
>     F +49.30 24627 120 <tel:%2B49.30%2024627%20120>
> 
>     [email protected] <mailto:[email protected]>
>     http://www.neofonie.de
> 
>     Handelsregister
>     Berlin-Charlottenburg: HRB 67460
> 
>     Geschäftsführung:
>     Thomas Kitlitschko
>     --------------------------------
> 
> 


-- 

--------------------------------
Walter Tietze
Senior Softwareengineer
Research

Neofonie GmbH
Robert-Koch-Platz 4
10115 Berlin

T +49.30 24627 318
F +49.30 24627 120

[email protected]
http://www.neofonie.de

Handelsregister
Berlin-Charlottenburg: HRB 67460

Geschäftsführung:
Thomas Kitlitschko
--------------------------------

Re: problem running Nutch 1.5.1 in distributed mode- simple crawl

Reply via email to