Hi Puneet
Responses inline
On Wed, Aug 15, 2018 at 7:20 AM wrote:
>
> From: Puneet Dhanda
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 15 Aug 2018 10:02:12 -0400
> Subject: Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.
> Hi,
>
> I am using the Nutch- 2.3.1 with MongoDB as the datastore.
Are you using it from SCM or the release? If I were you I would use from
SCM, we fixed a few bugs in there.
> While crawling
> the sites, getting the following error. Please assist what could be wrong
> here.
>
> Hadoop.log exception
> 2018-08-15 09:56:42,139 INFO httpclient.HttpMethodDirector - Retrying
> request
> 2018-08-15 09:56:42,139 INFO httpclient.HttpMethodDirector - I/O exception
> (java.net.ConnectException) caught when processing request: Connection
> refused (Connection refused)
> 2018-08-15 09:56:42,139 INFO httpclient.HttpMethodDirector - Retrying
> request
> 2018-08-15 09:56:42,242 ERROR httpclient.Http - Failed with the following
> error:
> java.net.ConnectException: Connection refused (Connection refused)
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net
> .AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at
> java.net
> .AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> 2018-08-15 09:56:46,409 INFO fetcher.FetcherJob - 0/0 spinwaiting/active,
> 2 pages, 2 errors, 0.4 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
>
You may wish to use the parser checker tooling to ensure that you can reach
the 2 failed URLs without executing a full crawl
https://wiki.apache.org/nutch/bin/nutch%20parsechecker
Also, you can try setting DEBUG or TRACE logging for this tool, see
https://github.com/apache/nutch/blob/2.x/conf/log4j.properties#L40
Lewis