I am using nutch 1.2 on Windows 7.
Am Samstag, 21. Juli 2012 um 14:53 schrieb lewis john mcgibbney [via Lucene]: > Looks to be a file path issue What OS are you running on? Also which > version of Nutch? > > > On Sat, Jul 21, 2012 at 1:45 PM, Max Stricker <[hidden email] > (/user/SendEmail.jtp?type=node&node=3996462&i=0)> wrote: > > > Hi, > > > > I currently try to integrate Nutch into a Java application. > > I adapted the original Crawl class to my needs and perform the following > > steps: > > - configuring > > - injecting > > - generating > > - fetching > > - updating CrawlDB > > - indexing into Solr > > > > As originaly, Nutch is configured by various xml files, I am not sure > > how to correctly configure the nutch API. > > > > According to http://wiki.apache.org/nutch/JavaDemoApplication > > setting plugins.folders should be enough, but that does not work for me. > > > > I currently set the following configuration: > > > > > > Configuration conf = NutchConfiguration.createCrawlConfiguration(); > > conf.set("plugin.folders", "C:/server/nutch/plugins/"); > > conf.set("plugin.includes", > > "myplugin|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"); > > > > conf.set("urlfilter.regex.file", > > "C:/server/nutch/conf/regex-urlfilter.txt"); > > conf.set("urlnormalizer.regex.file", > > "C:/server/nutch/conf/regex-normalize.xml"); > > > > I get no exceptions, but the following log entries show up: > > > > 12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource: > > C:/server/nutch/conf/regex-urlfilter.txt > > 12/07/21 14:29:24 WARN regex.RegexURLNormalizer: Can't load the default > > config file! C:/server/nutch/conf/regex-normalize.xml > > 12/07/21 14:29:24 INFO crawl.FetchScheduleFactory: Using FetchSchedule > > impl: org.apache.nutch.crawl.DefaultFetchSchedu > > > > But these files are certainly there and accessible. > > > > > > At the end I get > > > > Stopping at depth=0 - no more URLs to fetch. > > No URLs to fetch - check your seed list and URL filters. > > crawl finished: C:/server/nutch/crawl > > > > It created files under C:/server/nutch/crawl/crawldb/current/part-00000 > > but it seems not to fetch any page: > > WARN crawl.Generator: Generator: 0 records selected for fetching, exiting > > > > Using the bin/nutch script everything worked fine, so I assume there is a > > configuration > > issue somewhere. > > Any advice on what to configure when integrating Nutch into a java > > application? > > > > Regards, > > > > Max > > > > > > > > > > > -- > Lewis > > > If you reply to this email, your message will be added to the discussion > below: > http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996462.html > To start a new topic under Nutch - User, email > [email protected] > (mailto:[email protected]) > To unsubscribe from Nutch - User, click here > (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=c3RyaWNrZXIubWFAZ21haWwuY29tfDYwMzE0N3w5MzUzMTkxOTA=). > NAML > (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml) > -- View this message in context: http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996463.html Sent from the Nutch - User mailing list archive at Nabble.com.

