Re: Nutch not crawling on a pre-existing hadoop cluster?

Julien Nioche Tue, 07 Jun 2011 15:25:00 -0700

Hi Brian,

Would be easier to simply generate a job file and the script in bin to run
the tasks. Hardcopying the plugins + jars on each machine is not practical.
The reason we separated the jars+plugins approach from the job in the
runtimes for 1.3 was to avoid possible conflicts.


Julien



> I recently downloaded nutch onto my local machine.  I wrote a few plugins
> for it and successfully crawled a few sites to make sure that my parsers and
> indexers worked well.  I then moved the nutch installation onto our
> pre-existing hadoop cluster by copying the needed libs, confs, and the
> build/plugins dir onto every machine in the hadoop cluster, I also adjusted
> the nutch-site.xml to point the plugins to the hard coded path where the
> plugins sit.  The nutch system runs without errors, however it never past a
> few pages. It just seems to get stuck only grabbing one page per level and
> gets that page on every pass. I have included the interesting files and sys
> logs in the attachment for easy viewing. Anyone have any ideas on why it's
> not going forward? It also just seems to abort threads, any ideas?
>
> 2011-06-03 16:20:51,559 WARN org.apache.nutch.parse.ParserFactory: 
> ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to 
> contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml 
> file does not claim to support contentType: application/xhtml+xml
> 2011-06-03 16:20:51,629 INFO org.apache.nutch.fetcher.Fetcher: 
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=19
> 2011-06-03 16:20:51,629 WARN org.apache.nutch.fetcher.Fetcher: Aborting with 
> 10 hung threads.
>
>
> --
> Brian Griffey
> ShopSavvy Android and Big Data Developer
> 650-352-1429
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Nutch not crawling on a pre-existing hadoop cluster?

Reply via email to