Re: Integrating Nutch

Sebastian Nagel Sun, 22 Jul 2012 12:59:16 -0700

>     conf.set("urlfilter.regex.file", 
> "C:/server/nutch/conf/regex-urlfilter.txt");
>     conf.set("urlnormalizer.regex.file", 
> "C:/server/nutch/conf/regex-normalize.xml");
>
> I get no exceptions, but the following log entries show up:
>
> 12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource:
C:/server/nutch/conf/regex-urlfilter.txt
> 12/07/21 14:29:24 WARN regex.RegexURLNormalizer: Can't load the default 
> config file!
C:/server/nutch/conf/regex-normalize.xml


Resources such as the URL filter and normalizer rule files
are usually defined as pure files without path and are located
on the classpath. So it should work if
 C:/server/nutch/conf/
is in the classpath and the resources are simply named "regex-urlfilter.txt"
resp. "regex-normalize.xml".

http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/conf/Configuration.html
says that resources could be specified by a Path which would be "global"
if it starts with a slash. Maybe
 /cygdrive/c/...
works. But the solution with files and the classpath looks more portable.

Sebastian


On 07/21/2012 02:45 PM, Max Stricker wrote:
> Hi,
> 
> I currently try to integrate Nutch into a Java application.
> I adapted the original Crawl class to my needs and perform the following
> steps:
> - configuring
> - injecting
> - generating
> - fetching
> - updating CrawlDB
> - indexing into Solr
> 
> As originaly, Nutch is configured by various xml files, I am not sure
> how to correctly configure the nutch API.
> 
> According to http://wiki.apache.org/nutch/JavaDemoApplication
> setting plugins.folders should be enough, but that does not work for me.
> 
> I currently set the following configuration:
> 
> 
>    Configuration conf = NutchConfiguration.createCrawlConfiguration();
>     conf.set("plugin.folders", "C:/server/nutch/plugins/");
>     conf.set("plugin.includes", 
> "myplugin|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)");
>     conf.set("urlfilter.regex.file", 
> "C:/server/nutch/conf/regex-urlfilter.txt");
>     conf.set("urlnormalizer.regex.file", 
> "C:/server/nutch/conf/regex-normalize.xml");
> 
> I get no exceptions, but the following log entries show up:
> 
> 12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource: 
> C:/server/nutch/conf/regex-urlfilter.txt
> 12/07/21 14:29:24 WARN regex.RegexURLNormalizer: Can't load the default 
> config file! C:/server/nutch/conf/regex-normalize.xml
> 12/07/21 14:29:24 INFO crawl.FetchScheduleFactory: Using FetchSchedule impl: 
> org.apache.nutch.crawl.DefaultFetchSchedu
> 
> But these files are certainly there and accessible.
> 
> 
> At the end I get
> 
> Stopping at depth=0 - no more URLs to fetch.
> No URLs to fetch - check your seed list and URL filters.
> crawl finished: C:/server/nutch/crawl
> 
> It created files under C:/server/nutch/crawl/crawldb/current/part-00000
> but it seems not to fetch any page:
>  WARN crawl.Generator: Generator: 0 records selected for fetching, exiting
> 
> Using the bin/nutch script everything worked fine, so I assume there is a 
> configuration
> issue somewhere.
> Any advice on what to configure when integrating Nutch into a java 
> application?
> 
> Regards,
> 
> Max
> 
> 
>  
> 
>

Re: Integrating Nutch

Reply via email to