RE: Nutch2 - What are exactly the steps to execute?

2016-11-21 Thread lewis john mcgibbney
Hi Daniele,
In short, if I were you I would look into using the readdb resource
https://wiki.apache.org/nutch/bin/nutch%20readdb
This will enable you to take a peek into your MongoDB table and find out
which documents are present. By the looks of it from your Gist nothing is
being fetched and therefore no outlinks are being parsed out... however I
may be wrong. You can check using the readdb resource as above.
hth

On Sat, Nov 19, 2016 at 8:09 AM,  wrote:

> From: Daniele Cremonini 
> To: 
> Cc:
> Date: Fri, 18 Nov 2016 15:28:49 +0100 (CET)
> Subject: Nutch2 - What are exactly the steps to execute?
> Hello,
>
> I installed and configured Nutch2 with MongoDB and Elasticsearch.
>
> I’m pretty convinced that the configuration is correct but I don’t see how
> to invoke Nutch.
>
> In this page : https://wiki.apache.org/nutch/NutchTutorial there are I
> think enough details to call Nutch 1.x
> but in this page : https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke
> chapter is pretty poor.
>
> What I did :
>
> bin/nutch inject /apps/nutch-urls/
> bin/nutch generate -topN 40
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb -all
> bin/nutch index –all
>
> but Nutch never tries to index data I know because I enriched the logging
> activity of ElasticIndexWriter a little bit.
>
> May anybody give me some ideas?
>
>


RE: Nutch2 - What are exactly the steps to execute?

2016-11-18 Thread Daniele Cremonini
Thank you Tom and Marty,

Here is the snippet for configuring the plugin:



plugin.includes

protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor
)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)
Regular expression naming plugin directory
names to
include.  Any plugin not matching this expression is
excluded.
In any case you need at least include the
nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text
via HTTP,
and basic indexing and search plugins. In order to use
HTTPS please enable
protocol-httpclient, but be aware of possible intermittent
problems with the
underlying commons-httpclient library.



And here is the Gist:
https://gist.github.com/dcremonini/563e612e9d5c7051ea31c3a7fd9f5966

One think among others I could miss is the invertlinks step.
Cheers
Daniele Cremonini


-Message d'origine-
De : Marty-Scott Sainty (NWIS - Software Development)
[mailto:marty-scott.sai...@wales.nhs.uk]
Envoyé : vendredi 18 novembre 2016 16:44
À : user@nutch.apache.org
Objet : RE: Nutch2 - What are exactly the steps to execute?

Hi Tom,

You make sure you have specified the elastic search indexer plugin in
/conf/nutch-site.xml

  
plugin.includes
indexer-elastic
  


-Original Message-
From: Tom Chiverton [mailto:t...@extravision.com]
Sent: 18 November 2016 15:38
To: user@nutch.apache.org
Subject: Re: Nutch2 - What are exactly the steps to execute?

Please post the output of each step.

You might want to use something like a GitHub Gist for that as it could be
fairly long over email.

Tom


On 18/11/16 14:28, Daniele Cremonini wrote:
> Hello,
>
> I installed and configured Nutch2 with MongoDB and Elasticsearch.
>
> I'm pretty convinced that the configuration is correct but I don't see
> how to invoke Nutch.
>
> In this page : https://wiki.apache.org/nutch/NutchTutorial there are I
> think enough details to call Nutch 1.x but in this page :
> https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke chapter is
> pretty poor.
>
> What I did :
>
> bin/nutch inject /apps/nutch-urls/
> bin/nutch generate -topN 40
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb -all
> bin/nutch index -all
>
> but Nutch never tries to index data I know because I enriched the
> logging activity of ElasticIndexWriter a little bit.
>
> May anybody give me some ideas?
>
> Thanks
> Daniele
>
> __
> This email has been scanned by the Symantec Email Security.cloud
service.
> For more information please visit http://www.symanteccloud.com
> __
>



RE: Nutch2 - What are exactly the steps to execute?

2016-11-18 Thread Marty-Scott Sainty (NWIS - Software Development)
Hi Tom,

You make sure you have specified the elastic search indexer plugin in 
/conf/nutch-site.xml

  
plugin.includes
indexer-elastic
  


-Original Message-
From: Tom Chiverton [mailto:t...@extravision.com] 
Sent: 18 November 2016 15:38
To: user@nutch.apache.org
Subject: Re: Nutch2 - What are exactly the steps to execute?

Please post the output of each step.

You might want to use something like a GitHub Gist for that as it could be 
fairly long over email.

Tom


On 18/11/16 14:28, Daniele Cremonini wrote:
> Hello,
>
> I installed and configured Nutch2 with MongoDB and Elasticsearch.
>
> I'm pretty convinced that the configuration is correct but I don't see 
> how to invoke Nutch.
>
> In this page : https://wiki.apache.org/nutch/NutchTutorial there are I 
> think enough details to call Nutch 1.x but in this page : 
> https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke chapter is 
> pretty poor.
>
> What I did :
>
> bin/nutch inject /apps/nutch-urls/
> bin/nutch generate -topN 40
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb -all
> bin/nutch index -all
>
> but Nutch never tries to index data I know because I enriched the 
> logging activity of ElasticIndexWriter a little bit.
>
> May anybody give me some ideas?
>
> Thanks
> Daniele
>
> __
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com 
> __
>



Re: Nutch2 - What are exactly the steps to execute?

2016-11-18 Thread Tom Chiverton

Please post the output of each step.

You might want to use something like a GitHub Gist for that as it could 
be fairly long over email.


Tom


On 18/11/16 14:28, Daniele Cremonini wrote:

Hello,

I installed and configured Nutch2 with MongoDB and Elasticsearch.

I’m pretty convinced that the configuration is correct but I don’t see how
to invoke Nutch.

In this page : https://wiki.apache.org/nutch/NutchTutorial there are I
think enough details to call Nutch 1.x
but in this page : https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke
chapter is pretty poor.

What I did :

bin/nutch inject /apps/nutch-urls/
bin/nutch generate -topN 40
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/nutch index –all

but Nutch never tries to index data I know because I enriched the logging
activity of ElasticIndexWriter a little bit.

May anybody give me some ideas?

Thanks
Daniele

__
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
__