With Cygwin?

Some comments from me.

1) I've not had great experiences using the windows setup
2) Take a look at the file path's. I'm not sure if they are correctly defined.
Should they not be set to C:// e.g. double fwd slash?

I've not used windows in ages so excuse if this is nonsense.

On Sat, Jul 21, 2012 at 2:01 PM, jasimop <[email protected]> wrote:
> I am using nutch 1.2 on Windows 7.
>
>
> Am Samstag, 21. Juli 2012 um 14:53 schrieb lewis john mcgibbney [via Lucene]:
>
>> Looks to be a file path issue What OS are you running on? Also which
>> version of Nutch?
>>
>>
>> On Sat, Jul 21, 2012 at 1:45 PM, Max Stricker <[hidden email] 
>> (/user/SendEmail.jtp?type=node&node=3996462&i=0)> wrote:
>>
>> > Hi,
>> >
>> > I currently try to integrate Nutch into a Java application.
>> > I adapted the original Crawl class to my needs and perform the following
>> > steps:
>> > - configuring
>> > - injecting
>> > - generating
>> > - fetching
>> > - updating CrawlDB
>> > - indexing into Solr
>> >
>> > As originaly, Nutch is configured by various xml files, I am not sure
>> > how to correctly configure the nutch API.
>> >
>> > According to http://wiki.apache.org/nutch/JavaDemoApplication
>> > setting plugins.folders should be enough, but that does not work for me.
>> >
>> > I currently set the following configuration:
>> >
>> >
>> >    Configuration conf = NutchConfiguration.createCrawlConfiguration();
>> >     conf.set("plugin.folders", "C:/server/nutch/plugins/");
>> >     conf.set("plugin.includes", 
>> > "myplugin|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)");
>> >     conf.set("urlfilter.regex.file", 
>> > "C:/server/nutch/conf/regex-urlfilter.txt");
>> >     conf.set("urlnormalizer.regex.file", 
>> > "C:/server/nutch/conf/regex-normalize.xml");
>> >
>> > I get no exceptions, but the following log entries show up:
>> >
>> > 12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource: 
>> > C:/server/nutch/conf/regex-urlfilter.txt
>> > 12/07/21 14:29:24 WARN regex.RegexURLNormalizer: Can't load the default 
>> > config file! C:/server/nutch/conf/regex-normalize.xml
>> > 12/07/21 14:29:24 INFO crawl.FetchScheduleFactory: Using FetchSchedule 
>> > impl: org.apache.nutch.crawl.DefaultFetchSchedu
>> >
>> > But these files are certainly there and accessible.
>> >
>> >
>> > At the end I get
>> >
>> > Stopping at depth=0 - no more URLs to fetch.
>> > No URLs to fetch - check your seed list and URL filters.
>> > crawl finished: C:/server/nutch/crawl
>> >
>> > It created files under C:/server/nutch/crawl/crawldb/current/part-00000
>> > but it seems not to fetch any page:
>> >  WARN crawl.Generator: Generator: 0 records selected for fetching, exiting
>> >
>> > Using the bin/nutch script everything worked fine, so I assume there is a 
>> > configuration
>> > issue somewhere.
>> > Any advice on what to configure when integrating Nutch into a java 
>> > application?
>> >
>> > Regards,
>> >
>> > Max
>> >
>> >
>> >
>> >
>>
>>
>> --
>> Lewis
>>
>>
>> If you reply to this email, your message will be added to the discussion 
>> below: 
>> http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996462.html
>> To start a new topic under Nutch - User, email 
>> [email protected] 
>> (mailto:[email protected])
>> To unsubscribe from Nutch - User, click here 
>> (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=c3RyaWNrZXIubWFAZ21haWwuY29tfDYwMzE0N3w5MzUzMTkxOTA=).
>> NAML 
>> (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996463.html
> Sent from the Nutch - User mailing list archive at Nabble.com.



-- 
Lewis

Reply via email to