Hi,

I currently try to integrate Nutch into a Java application.
I adapted the original Crawl class to my needs and perform the following
steps:
- configuring
- injecting
- generating
- fetching
- updating CrawlDB
- indexing into Solr

As originaly, Nutch is configured by various xml files, I am not sure
how to correctly configure the nutch API.

According to http://wiki.apache.org/nutch/JavaDemoApplication
setting plugins.folders should be enough, but that does not work for me.

I currently set the following configuration:


   Configuration conf = NutchConfiguration.createCrawlConfiguration();
    conf.set("plugin.folders", "C:/server/nutch/plugins/");
    conf.set("plugin.includes", 
"myplugin|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)");
    conf.set("urlfilter.regex.file", 
"C:/server/nutch/conf/regex-urlfilter.txt");
    conf.set("urlnormalizer.regex.file", 
"C:/server/nutch/conf/regex-normalize.xml");

I get no exceptions, but the following log entries show up:

12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource: 
C:/server/nutch/conf/regex-urlfilter.txt
12/07/21 14:29:24 WARN regex.RegexURLNormalizer: Can't load the default config 
file! C:/server/nutch/conf/regex-normalize.xml
12/07/21 14:29:24 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: 
org.apache.nutch.crawl.DefaultFetchSchedu

But these files are certainly there and accessible.


At the end I get

Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: C:/server/nutch/crawl

It created files under C:/server/nutch/crawl/crawldb/current/part-00000
but it seems not to fetch any page:
 WARN crawl.Generator: Generator: 0 records selected for fetching, exiting

Using the bin/nutch script everything worked fine, so I assume there is a 
configuration
issue somewhere.
Any advice on what to configure when integrating Nutch into a java application?

Regards,

Max


 

Reply via email to