Re: Error parsing html

Sebastian Nagel Tue, 09 Oct 2012 15:30:35 -0700

> I should mention, that I'm using Nutch in a Web-Application.
It's possible though it's hard.


> While debugging I came across the runParser method in ParseUtil class in
> which the task.get(MAX_PARSE_TIME, TimeUnit.SECONDS); returns null.
See http://wiki.apache.org/nutch/RunNutchInEclipse#Debugging_and_Timeouts
(default timeout is 30 sec., you cannot seriously debug within this time)

> Therefore i included nutch.jar (which i found in the bin.zip download), i 
> copied the
> following folders to the project workspace: conf, crawl, plugins, runtime
What about lib/ and all contained jars? You need all of them. Also
libs required by parse plugins are among them. This would explain why
fetching succeeds and parsing failed.

In general, setting up the class path is not trivial.
Have a look at the script bin/nutch and try to construct the
path the same way. Or even better (and much easier to develop):
run the crawler from your webapp via System.exec() calling
a shell script which does the job.

To give more detailed help we need more information:
 - class path
 - exact call of the crawler

Sebastian

Re: Error parsing html

Reply via email to