Yes I use Cygwin, otherwise Hadoop has some problems because it uses the chmod command.
I do not have any issues with the current windows setup when using the bin/nutch command. But I have the requirement to control basic functionality from my application, and calling the nutch shell scripts from there is a dirty hack and results in new problems. I am also not restricted to windows, I develop on windows but the final system will be deployed to a linux box. The file paths should be ok, as plugin.folders is interpreted correctly and the plugins are loaded: 12/07/21 21:26:36 INFO mapred.LocalJobRunner: 12/07/21 21:26:36 INFO plugin.PluginRepository: Plugins: looking in: C:\server\nutch\plugins 12/07/21 21:26:37 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 12/07/21 21:26:37 INFO plugin.PluginRepository: Registered Plugins: 12/07/21 21:26:37 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 12/07/21 21:26:37 INFO plugin.PluginRepository: Basic Query Filter (query-basic) 12/07/21 21:26:37 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 12/07/21 21:26:37 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 12/07/21 21:26:37 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 12/07/21 21:26:37 INFO plugin.PluginRepository: Site Query Filter (query-site) 12/07/21 21:26:37 INFO plugin.PluginRepository: Basic Summarizer Plug-in (summary-basic) 12/07/21 21:26:37 INFO plugin.PluginRepository: HTTP Framework (lib-http) 12/07/21 21:26:37 INFO plugin.PluginRepository: Text Parse Plug-in (parse-text) 12/07/21 21:26:37 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 12/07/21 21:26:37 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 12/07/21 21:26:37 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 12/07/21 21:26:37 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 12/07/21 21:26:37 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 12/07/21 21:26:37 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 12/07/21 21:26:37 INFO plugin.PluginRepository: JavaScript Parser (parse-js) 12/07/21 21:26:37 INFO plugin.PluginRepository: URL Query Filter (query-url) 12/07/21 21:26:37 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) Do I need to configure anything else? Max Am Samstag, 21. Juli 2012 um 15:07 schrieb lewis john mcgibbney [via Lucene]: > With Cygwin? > > Some comments from me. > > 1) I've not had great experiences using the windows setup > 2) Take a look at the file path's. I'm not sure if they are correctly > defined. > Should they not be set to C:// e.g. double fwd slash? > > I've not used windows in ages so excuse if this is nonsense. > > On Sat, Jul 21, 2012 at 2:01 PM, jasimop <[hidden email] > (/user/SendEmail.jtp?type=node&node=3996464&i=0)> wrote: > > > I am using nutch 1.2 on Windows 7. > > > > > > Am Samstag, 21. Juli 2012 um 14:53 schrieb lewis john mcgibbney [via > > Lucene]: > > > >> Looks to be a file path issue What OS are you running on? Also which > >> version of Nutch? > >> > >> > >> On Sat, Jul 21, 2012 at 1:45 PM, Max Stricker <[hidden email] > >> (/user/SendEmail.jtp?type=node&node=3996462&i=0)> wrote: > >> > >> > Hi, > >> > > >> > I currently try to integrate Nutch into a Java application. > >> > I adapted the original Crawl class to my needs and perform the following > >> > steps: > >> > - configuring > >> > - injecting > >> > - generating > >> > - fetching > >> > - updating CrawlDB > >> > - indexing into Solr > >> > > >> > As originaly, Nutch is configured by various xml files, I am not sure > >> > how to correctly configure the nutch API. > >> > > >> > According to http://wiki.apache.org/nutch/JavaDemoApplication > >> > setting plugins.folders should be enough, but that does not work for me. > >> > > >> > I currently set the following configuration: > >> > > >> > > >> > Configuration conf = NutchConfiguration.createCrawlConfiguration(); > >> > conf.set("plugin.folders", "C:/server/nutch/plugins/"); > >> > conf.set("plugin.includes", > >> > "myplugin|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"); > >> > > >> > conf.set("urlfilter.regex.file", > >> > "C:/server/nutch/conf/regex-urlfilter.txt"); > >> > conf.set("urlnormalizer.regex.file", > >> > "C:/server/nutch/conf/regex-normalize.xml"); > >> > > >> > I get no exceptions, but the following log entries show up: > >> > > >> > 12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource: > >> > C:/server/nutch/conf/regex-urlfilter.txt > >> > 12/07/21 14:29:24 WARN regex.RegexURLNormalizer: Can't load the default > >> > config file! C:/server/nutch/conf/regex-normalize.xml > >> > 12/07/21 14:29:24 INFO crawl.FetchScheduleFactory: Using FetchSchedule > >> > impl: org.apache.nutch.crawl.DefaultFetchSchedu > >> > > >> > But these files are certainly there and accessible. > >> > > >> > > >> > At the end I get > >> > > >> > Stopping at depth=0 - no more URLs to fetch. > >> > No URLs to fetch - check your seed list and URL filters. > >> > crawl finished: C:/server/nutch/crawl > >> > > >> > It created files under C:/server/nutch/crawl/crawldb/current/part-00000 > >> > but it seems not to fetch any page: > >> > WARN crawl.Generator: Generator: 0 records selected for fetching, > >> > exiting > >> > > >> > Using the bin/nutch script everything worked fine, so I assume there is > >> > a configuration > >> > issue somewhere. > >> > Any advice on what to configure when integrating Nutch into a java > >> > application? > >> > > >> > Regards, > >> > > >> > Max > >> > > >> > > >> > > >> > > >> > >> > >> -- > >> Lewis > >> > >> > >> If you reply to this email, your message will be added to the discussion > >> below: > >> http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996462.html > >> To start a new topic under Nutch - User, email [hidden email] > >> (/user/SendEmail.jtp?type=node&node=3996464&i=1) (mailto:[hidden email] > >> (/user/SendEmail.jtp?type=node&node=3996464&i=2)) > >> To unsubscribe from Nutch - User, click here ( > >> NAML > >> (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml) > >> > > > > > > > > > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996463.html > > Sent from the Nutch - User mailing list archive at Nabble.com > > (http://Nabble.com). > > > -- > Lewis > > > If you reply to this email, your message will be added to the discussion > below: > http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996464.html > To start a new topic under Nutch - User, email > [email protected] > (mailto:[email protected]) > To unsubscribe from Nutch - User, click here > (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=c3RyaWNrZXIubWFAZ21haWwuY29tfDYwMzE0N3w5MzUzMTkxOTA=). > NAML > (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml) > -- View this message in context: http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996495.html Sent from the Nutch - User mailing list archive at Nabble.com.

