With Cygwin? Some comments from me.
1) I've not had great experiences using the windows setup 2) Take a look at the file path's. I'm not sure if they are correctly defined. Should they not be set to C:// e.g. double fwd slash? I've not used windows in ages so excuse if this is nonsense. On Sat, Jul 21, 2012 at 2:01 PM, jasimop <[email protected]> wrote: > I am using nutch 1.2 on Windows 7. > > > Am Samstag, 21. Juli 2012 um 14:53 schrieb lewis john mcgibbney [via Lucene]: > >> Looks to be a file path issue What OS are you running on? Also which >> version of Nutch? >> >> >> On Sat, Jul 21, 2012 at 1:45 PM, Max Stricker <[hidden email] >> (/user/SendEmail.jtp?type=node&node=3996462&i=0)> wrote: >> >> > Hi, >> > >> > I currently try to integrate Nutch into a Java application. >> > I adapted the original Crawl class to my needs and perform the following >> > steps: >> > - configuring >> > - injecting >> > - generating >> > - fetching >> > - updating CrawlDB >> > - indexing into Solr >> > >> > As originaly, Nutch is configured by various xml files, I am not sure >> > how to correctly configure the nutch API. >> > >> > According to http://wiki.apache.org/nutch/JavaDemoApplication >> > setting plugins.folders should be enough, but that does not work for me. >> > >> > I currently set the following configuration: >> > >> > >> > Configuration conf = NutchConfiguration.createCrawlConfiguration(); >> > conf.set("plugin.folders", "C:/server/nutch/plugins/"); >> > conf.set("plugin.includes", >> > "myplugin|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"); >> > conf.set("urlfilter.regex.file", >> > "C:/server/nutch/conf/regex-urlfilter.txt"); >> > conf.set("urlnormalizer.regex.file", >> > "C:/server/nutch/conf/regex-normalize.xml"); >> > >> > I get no exceptions, but the following log entries show up: >> > >> > 12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource: >> > C:/server/nutch/conf/regex-urlfilter.txt >> > 12/07/21 14:29:24 WARN regex.RegexURLNormalizer: Can't load the default >> > config file! C:/server/nutch/conf/regex-normalize.xml >> > 12/07/21 14:29:24 INFO crawl.FetchScheduleFactory: Using FetchSchedule >> > impl: org.apache.nutch.crawl.DefaultFetchSchedu >> > >> > But these files are certainly there and accessible. >> > >> > >> > At the end I get >> > >> > Stopping at depth=0 - no more URLs to fetch. >> > No URLs to fetch - check your seed list and URL filters. >> > crawl finished: C:/server/nutch/crawl >> > >> > It created files under C:/server/nutch/crawl/crawldb/current/part-00000 >> > but it seems not to fetch any page: >> > WARN crawl.Generator: Generator: 0 records selected for fetching, exiting >> > >> > Using the bin/nutch script everything worked fine, so I assume there is a >> > configuration >> > issue somewhere. >> > Any advice on what to configure when integrating Nutch into a java >> > application? >> > >> > Regards, >> > >> > Max >> > >> > >> > >> > >> >> >> -- >> Lewis >> >> >> If you reply to this email, your message will be added to the discussion >> below: >> http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996462.html >> To start a new topic under Nutch - User, email >> [email protected] >> (mailto:[email protected]) >> To unsubscribe from Nutch - User, click here >> (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=c3RyaWNrZXIubWFAZ21haWwuY29tfDYwMzE0N3w5MzUzMTkxOTA=). >> NAML >> (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml) > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996463.html > Sent from the Nutch - User mailing list archive at Nabble.com. -- Lewis

