RE: Nutch2 - What are exactly the steps to execute?
Hi Daniele, In short, if I were you I would look into using the readdb resource https://wiki.apache.org/nutch/bin/nutch%20readdb This will enable you to take a peek into your MongoDB table and find out which documents are present. By the looks of it from your Gist nothing is being fetched and therefore no outlinks are being parsed out... however I may be wrong. You can check using the readdb resource as above. hth On Sat, Nov 19, 2016 at 8:09 AM,wrote: > From: Daniele Cremonini > To: > Cc: > Date: Fri, 18 Nov 2016 15:28:49 +0100 (CET) > Subject: Nutch2 - What are exactly the steps to execute? > Hello, > > I installed and configured Nutch2 with MongoDB and Elasticsearch. > > I’m pretty convinced that the configuration is correct but I don’t see how > to invoke Nutch. > > In this page : https://wiki.apache.org/nutch/NutchTutorial there are I > think enough details to call Nutch 1.x > but in this page : https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke > chapter is pretty poor. > > What I did : > > bin/nutch inject /apps/nutch-urls/ > bin/nutch generate -topN 40 > bin/nutch fetch -all > bin/nutch parse -all > bin/nutch updatedb -all > bin/nutch index –all > > but Nutch never tries to index data I know because I enriched the logging > activity of ElasticIndexWriter a little bit. > > May anybody give me some ideas? > >
RE: Nutch2 - What are exactly the steps to execute?
Thank you Tom and Marty, Here is the snippet for configuring the plugin: plugin.includes protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor )|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. And here is the Gist: https://gist.github.com/dcremonini/563e612e9d5c7051ea31c3a7fd9f5966 One think among others I could miss is the invertlinks step. Cheers Daniele Cremonini -Message d'origine- De : Marty-Scott Sainty (NWIS - Software Development) [mailto:marty-scott.sai...@wales.nhs.uk] Envoyé : vendredi 18 novembre 2016 16:44 À : user@nutch.apache.org Objet : RE: Nutch2 - What are exactly the steps to execute? Hi Tom, You make sure you have specified the elastic search indexer plugin in /conf/nutch-site.xml plugin.includes indexer-elastic -Original Message- From: Tom Chiverton [mailto:t...@extravision.com] Sent: 18 November 2016 15:38 To: user@nutch.apache.org Subject: Re: Nutch2 - What are exactly the steps to execute? Please post the output of each step. You might want to use something like a GitHub Gist for that as it could be fairly long over email. Tom On 18/11/16 14:28, Daniele Cremonini wrote: > Hello, > > I installed and configured Nutch2 with MongoDB and Elasticsearch. > > I'm pretty convinced that the configuration is correct but I don't see > how to invoke Nutch. > > In this page : https://wiki.apache.org/nutch/NutchTutorial there are I > think enough details to call Nutch 1.x but in this page : > https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke chapter is > pretty poor. > > What I did : > > bin/nutch inject /apps/nutch-urls/ > bin/nutch generate -topN 40 > bin/nutch fetch -all > bin/nutch parse -all > bin/nutch updatedb -all > bin/nutch index -all > > but Nutch never tries to index data I know because I enriched the > logging activity of ElasticIndexWriter a little bit. > > May anybody give me some ideas? > > Thanks > Daniele > > __ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > __ >
RE: Nutch2 - What are exactly the steps to execute?
Hi Tom, You make sure you have specified the elastic search indexer plugin in /conf/nutch-site.xml plugin.includes indexer-elastic -Original Message- From: Tom Chiverton [mailto:t...@extravision.com] Sent: 18 November 2016 15:38 To: user@nutch.apache.org Subject: Re: Nutch2 - What are exactly the steps to execute? Please post the output of each step. You might want to use something like a GitHub Gist for that as it could be fairly long over email. Tom On 18/11/16 14:28, Daniele Cremonini wrote: > Hello, > > I installed and configured Nutch2 with MongoDB and Elasticsearch. > > I'm pretty convinced that the configuration is correct but I don't see > how to invoke Nutch. > > In this page : https://wiki.apache.org/nutch/NutchTutorial there are I > think enough details to call Nutch 1.x but in this page : > https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke chapter is > pretty poor. > > What I did : > > bin/nutch inject /apps/nutch-urls/ > bin/nutch generate -topN 40 > bin/nutch fetch -all > bin/nutch parse -all > bin/nutch updatedb -all > bin/nutch index -all > > but Nutch never tries to index data I know because I enriched the > logging activity of ElasticIndexWriter a little bit. > > May anybody give me some ideas? > > Thanks > Daniele > > __ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > __ >
Re: Nutch2 - What are exactly the steps to execute?
Please post the output of each step. You might want to use something like a GitHub Gist for that as it could be fairly long over email. Tom On 18/11/16 14:28, Daniele Cremonini wrote: Hello, I installed and configured Nutch2 with MongoDB and Elasticsearch. I’m pretty convinced that the configuration is correct but I don’t see how to invoke Nutch. In this page : https://wiki.apache.org/nutch/NutchTutorial there are I think enough details to call Nutch 1.x but in this page : https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke chapter is pretty poor. What I did : bin/nutch inject /apps/nutch-urls/ bin/nutch generate -topN 40 bin/nutch fetch -all bin/nutch parse -all bin/nutch updatedb -all bin/nutch index –all but Nutch never tries to index data I know because I enriched the logging activity of ElasticIndexWriter a little bit. May anybody give me some ideas? Thanks Daniele __ This email has been scanned by the Symantec Email Security.cloud service. For more information please visit http://www.symanteccloud.com __