Nutch not crawling on a pre-existing hadoop cluster?

Brian Griffey Fri, 03 Jun 2011 14:52:13 -0700

Hi all,

I recently downloaded nutch onto my local machine. I wrote a few plugins for it 
and successfully crawled a few sites to make sure that my parsers and indexers 
worked well. I then moved the nutch installation onto our pre-existing hadoop 
cluster by copying the needed libs, confs, and the build/plugins dir onto every 
machine in the hadoop cluster, I also adjusted the nutch-site.xml to point the 
plugins to the hard coded path where the plugins sit. The nutch system runs 
without errors, however it never past a few pages. It just seems to get stuck 
only grabbing one page per level and gets that page on every pass. I have 
included the interesting files and sys logs in the attachment for easy viewing. 
Anyone have any ideas on why it's not going forward? It also just seems to 
abort threads, any ideas?


2011-06-03 16:20:51,559 WARN org.apache.nutch.parse.ParserFactory: 
ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to 
contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml 
file does not claim to support contentType: application/xhtml+xml 2011-06-03 
16:20:51,629 INFO org.apache.nutch.fetcher.Fetcher: -activeThreads=10, 
spinWaiting=9, fetchQueues.totalSize=19 2011-06-03 16:20:51,629 WARN 
org.apache.nutch.fetcher.Fetcher: Aborting with 10 hung threads.

-- 
Brian Griffey
ShopSavvy Android and Big Data Developer
650-352-1429

Nutch not crawling on a pre-existing hadoop cluster?

Reply via email to