The logs say that ntlm has been selected for the proxy authentication, but the 
authentication continues to fail.
Below is the log section. http.proxy.username and http.proxy.password are 
provided in conf/nutch-site.xml but they don't show up in the log.

2013-06-03 16:43:48,799 INFO  httpclient.Http - http.proxy.host = 10.x.y.z
2013-06-03 16:43:48,799 INFO  httpclient.Http - http.proxy.port = 8080
2013-06-03 16:43:48,799 INFO  httpclient.Http - http.timeout = 10000
2013-06-03 16:43:48,799 INFO  httpclient.Http - http.content.limit = 65536
2013-06-03 16:43:48,799 INFO  httpclient.Http - http.agent = myspider/Nutch-1.6
2013-06-03 16:43:48,799 INFO  httpclient.Http - http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
2013-06-03 16:43:48,799 INFO  httpclient.Http - http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2013-06-03 16:43:48,851 INFO  auth.AuthChallengeProcessor - ntlm authentication 
scheme selected
2013-06-03 16:43:49,054 INFO  httpclient.HttpMethodDirector - Failure 
authenticating with NTLM <any realm>@10.212.2.66:8080
2013-06-03 16:43:49,244 INFO  crawl.SignatureFactory - Using Signature impl: 
org.apache.nutch.crawl.MD5Signature
2013-06-03 16:43:49,245 INFO  parse.ParserChecker - parsing: 
http://www.google.com
2013-06-03 16:43:49,246 INFO  parse.ParserChecker - contentType: text/html
2013-06-03 16:43:49,246 INFO  parse.ParserChecker - signature: 
0d50f5f66ddb69b21f21ab0ad5b3d034

Suresh.

-----Original Message-----
From: Suresh V S [mailto:[email protected]] 
Sent: Monday, June 03, 2013 1:37 PM
To: [email protected]
Subject: RE: Nutch not crawling fully

Thanks for pointing out, Kiran. My bad I overlooked it.

I'm trying hard to authenticate with our proxy but always ending up with HTTP 
407.

My conf/nutch-site.xml has the http.proxy.host, http.proxy.port, 
http.proxy.username, http.proxy.password values set correctly.
The plugin.includes has the following:
<property>
  <name>plugin.includes</name>
  
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

Still, even google.com returns 407.. Any ideas?

Thank you
Suresh.


-----Original Message-----
From: kiran chitturi [mailto:[email protected]]
Sent: Monday, June 03, 2013 10:44 AM
To: [email protected]
Subject: Re: Nutch not crawling fully

> fetch of http://www.igate.com/ failed with: Http code=407, url= 
> http://www.igate.com <http://www.igate.com/ -finishing>


Hi Suresh,

The url is never successfully fetched. The http error code 407 is thrown away. 
That is the reason it is in unfetched status.

>
>
>
>
> dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch readdb 
> mondaycrawl/crawldb/ -stats CrawlDb statistics start:
> mondaycrawl/crawldb/ Statistics for CrawlDb: mondaycrawl/crawldb/
> TOTAL urls:     1
> retry 1:        1
> min score:      1.0
> avg score:      1.0
> max score:      1.0
> status 1 (db_unfetched):        1
> CrawlDb statistics: done
>
>
>
>
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Information contained and transmitted by this e-mail is confidential 
> and proprietary to iGATE and its affiliates and is intended for use 
> only by the recipient. If you are not the intended recipient, you are 
> hereby notified that any dissemination, distribution, copying or use 
> of this e-mail is strictly prohibited and you are requested to delete 
> this e-mail immediately and notify the originator or [email protected] 
> <mailto:
> [email protected]>. iGATE does not enter into any agreement with any 
> party by e-mail. Any views expressed by an individual do not 
> necessarily reflect the view of iGATE. iGATE is not responsible for 
> the consequences of any actions taken on the basis of information provided, 
> through this email.
> The contents of an attachment to this e-mail may contain software 
> viruses, which could damage your own computer system. While iGATE has 
> taken every reasonable precaution to minimise this risk, we cannot 
> accept liability for any damage which you sustain as a result of 
> software viruses. You should carry out your own virus checks before 
> opening an attachment. To know more about iGATE please visit www.igate.com 
> <http://www.igate.com>.
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>



--
Kiran Chitturi

<http://www.linkedin.com/in/kiranchitturi>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Information contained and transmitted by this e-mail is confidential and 
proprietary to iGATE and its affiliates and is intended for use only by the 
recipient. If you are not the intended recipient, you are hereby notified that 
any dissemination, distribution, copying or use of this e-mail is strictly 
prohibited and you are requested to delete this e-mail immediately and notify 
the originator or [email protected] <mailto:[email protected]>. iGATE does 
not enter into any agreement with any party by e-mail. Any views expressed by 
an individual do not necessarily reflect the view of iGATE. iGATE is not 
responsible for the consequences of any actions taken on the basis of 
information provided, through this email. The contents of an attachment to this 
e-mail may contain software viruses, which could damage your own computer 
system. While iGATE has taken every reasonable precaution to minimise this 
risk, we cannot accept liability for any damage which you sustain as a result 
of software viruses. You should carry out your own virus checks before opening 
an attachment. To know more about iGATE please visit www.igate.com 
<http://www.igate.com>.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Information contained and transmitted by this e-mail is confidential and 
proprietary to iGATE and its affiliates and is intended for use only by the 
recipient. If you are not the intended recipient, you are hereby notified that 
any dissemination, distribution, copying or use of this e-mail is strictly 
prohibited and you are requested to delete this e-mail immediately and notify 
the originator or [email protected] <mailto:[email protected]>. iGATE does 
not enter into any agreement with any party by e-mail. Any views expressed by 
an individual do not necessarily reflect the view of iGATE. iGATE is not 
responsible for the consequences of any actions taken on the basis of 
information provided, through this email. The contents of an attachment to this 
e-mail may contain software viruses, which could damage your own computer 
system. While iGATE has taken every reasonable precaution to minimise this 
risk, we cannot accept liability for any damage which you sustain as a result 
of software viruses. You should carry out your own virus checks before opening 
an attachment. To know more about iGATE please visit www.igate.com 
<http://www.igate.com>.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reply via email to