Re: Integrating Nutch

jasimop Sat, 21 Jul 2012 12:30:36 -0700

Yes I use Cygwin,
otherwise Hadoop has some problems because it uses the chmod command.


I do not have any issues with the current windows setup when using
the bin/nutch command.
But I have the requirement to control basic functionality from my 
application, and calling the nutch shell scripts from there is 
a dirty hack and results in new problems.
I am also not restricted to windows, I develop on windows but the
final system will be deployed to a linux box.

The file paths should be ok, as plugin.folders is interpreted
correctly and the plugins are loaded:

12/07/21 21:26:36 INFO mapred.LocalJobRunner: 
12/07/21 21:26:36 INFO plugin.PluginRepository: Plugins: looking in: 
C:\server\nutch\plugins
12/07/21 21:26:37 INFO plugin.PluginRepository: Plugin Auto-activation mode: 
[true]
12/07/21 21:26:37 INFO plugin.PluginRepository: Registered Plugins:
12/07/21 21:26:37 INFO plugin.PluginRepository:         the nutch core 
extension points (nutch-extensionpoints)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Basic Query Filter 
(query-basic)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Basic URL Normalizer 
(urlnormalizer-basic)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Html Parse Plug-in 
(parse-html)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Basic Indexing Filter 
(index-basic)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Site Query Filter 
(query-site)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Basic Summarizer 
Plug-in (summary-basic)
12/07/21 21:26:37 INFO plugin.PluginRepository:         HTTP Framework 
(lib-http)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Text Parse Plug-in 
(parse-text)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Regex URL Filter 
(urlfilter-regex)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Pass-through URL 
Normalizer (urlnormalizer-pass)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Http Protocol Plug-in 
(protocol-http)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Regex URL Normalizer 
(urlnormalizer-regex)
12/07/21 21:26:37 INFO plugin.PluginRepository:         OPIC Scoring Plug-in 
(scoring-opic)
12/07/21 21:26:37 INFO plugin.PluginRepository:         CyberNeko HTML Parser 
(lib-nekohtml)
12/07/21 21:26:37 INFO plugin.PluginRepository:         JavaScript Parser 
(parse-js)
12/07/21 21:26:37 INFO plugin.PluginRepository:         URL Query Filter 
(query-url)
12/07/21 21:26:37 INFO plugin.PluginRepository:         Regex URL Filter 
Framework 
(lib-regex-filter)


Do I need to configure anything else?

Max





Am Samstag, 21. Juli 2012 um 15:07 schrieb lewis john mcgibbney [via Lucene]:

> With Cygwin? 
> 
> Some comments from me. 
> 
> 1) I've not had great experiences using the windows setup 
> 2) Take a look at the file path's. I'm not sure if they are correctly 
> defined. 
> Should they not be set to C:// e.g. double fwd slash? 
> 
> I've not used windows in ages so excuse if this is nonsense. 
> 
> On Sat, Jul 21, 2012 at 2:01 PM, jasimop <[hidden email] 
> (/user/SendEmail.jtp?type=node&node=3996464&i=0)> wrote: 
> 
> > I am using nutch 1.2 on Windows 7. 
> > 
> > 
> > Am Samstag, 21. Juli 2012 um 14:53 schrieb lewis john mcgibbney [via 
> > Lucene]: 
> > 
> >> Looks to be a file path issue What OS are you running on? Also which 
> >> version of Nutch? 
> >> 
> >> 
> >> On Sat, Jul 21, 2012 at 1:45 PM, Max Stricker <[hidden email] 
> >> (/user/SendEmail.jtp?type=node&node=3996462&i=0)> wrote: 
> >> 
> >> > Hi, 
> >> > 
> >> > I currently try to integrate Nutch into a Java application. 
> >> > I adapted the original Crawl class to my needs and perform the following 
> >> > steps: 
> >> > - configuring 
> >> > - injecting 
> >> > - generating 
> >> > - fetching 
> >> > - updating CrawlDB 
> >> > - indexing into Solr 
> >> > 
> >> > As originaly, Nutch is configured by various xml files, I am not sure 
> >> > how to correctly configure the nutch API. 
> >> > 
> >> > According to http://wiki.apache.org/nutch/JavaDemoApplication
> >> > setting plugins.folders should be enough, but that does not work for me. 
> >> > 
> >> > I currently set the following configuration: 
> >> > 
> >> > 
> >> >    Configuration conf = NutchConfiguration.createCrawlConfiguration(); 
> >> >     conf.set("plugin.folders", "C:/server/nutch/plugins/"); 
> >> >     conf.set("plugin.includes", 
> >> > "myplugin|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)");
> >> >  
> >> >     conf.set("urlfilter.regex.file", 
> >> > "C:/server/nutch/conf/regex-urlfilter.txt"); 
> >> >     conf.set("urlnormalizer.regex.file", 
> >> > "C:/server/nutch/conf/regex-normalize.xml"); 
> >> > 
> >> > I get no exceptions, but the following log entries show up: 
> >> > 
> >> > 12/07/21 14:29:24 ERROR api.RegexURLFilterBase: Can't find resource: 
> >> > C:/server/nutch/conf/regex-urlfilter.txt 
> >> > 12/07/21 14:29:24 WARN regex.RegexURLNormalizer: Can't load the default 
> >> > config file! C:/server/nutch/conf/regex-normalize.xml 
> >> > 12/07/21 14:29:24 INFO crawl.FetchScheduleFactory: Using FetchSchedule 
> >> > impl: org.apache.nutch.crawl.DefaultFetchSchedu 
> >> > 
> >> > But these files are certainly there and accessible. 
> >> > 
> >> > 
> >> > At the end I get 
> >> > 
> >> > Stopping at depth=0 - no more URLs to fetch. 
> >> > No URLs to fetch - check your seed list and URL filters. 
> >> > crawl finished: C:/server/nutch/crawl 
> >> > 
> >> > It created files under C:/server/nutch/crawl/crawldb/current/part-00000 
> >> > but it seems not to fetch any page: 
> >> >  WARN crawl.Generator: Generator: 0 records selected for fetching, 
> >> > exiting 
> >> > 
> >> > Using the bin/nutch script everything worked fine, so I assume there is 
> >> > a configuration 
> >> > issue somewhere. 
> >> > Any advice on what to configure when integrating Nutch into a java 
> >> > application? 
> >> > 
> >> > Regards, 
> >> > 
> >> > Max 
> >> > 
> >> > 
> >> > 
> >> > 
> >> 
> >> 
> >> -- 
> >> Lewis 
> >> 
> >> 
> >> If you reply to this email, your message will be added to the discussion 
> >> below: 
> >> http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996462.html
> >> To start a new topic under Nutch - User, email [hidden email] 
> >> (/user/SendEmail.jtp?type=node&node=3996464&i=1) (mailto:[hidden email] 
> >> (/user/SendEmail.jtp?type=node&node=3996464&i=2)) 
> >> To unsubscribe from Nutch - User, click here ( 
> >> NAML 
> >> (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)
> >>  
> > 
> > 
> > 
> > 
> > 
> > -- 
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996463.html
> > Sent from the Nutch - User mailing list archive at Nabble.com 
> > (http://Nabble.com). 
> 
> 
> -- 
> Lewis 
> 
> 
> If you reply to this email, your message will be added to the discussion 
> below: 
> http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996464.html 
> To start a new topic under Nutch - User, email 
> [email protected] 
> (mailto:[email protected]) 
> To unsubscribe from Nutch - User, click here 
> (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=603147&code=c3RyaWNrZXIubWFAZ21haWwuY29tfDYwMzE0N3w5MzUzMTkxOTA=).
> NAML 
> (http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml)
>  





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Integrating-Nutch-tp3996461p3996495.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Integrating Nutch

Reply via email to