The logs say that ntlm has been selected for the proxy authentication, but the authentication continues to fail. Below is the log section. http.proxy.username and http.proxy.password are provided in conf/nutch-site.xml but they don't show up in the log.
2013-06-03 16:43:48,799 INFO httpclient.Http - http.proxy.host = 10.x.y.z 2013-06-03 16:43:48,799 INFO httpclient.Http - http.proxy.port = 8080 2013-06-03 16:43:48,799 INFO httpclient.Http - http.timeout = 10000 2013-06-03 16:43:48,799 INFO httpclient.Http - http.content.limit = 65536 2013-06-03 16:43:48,799 INFO httpclient.Http - http.agent = myspider/Nutch-1.6 2013-06-03 16:43:48,799 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2013-06-03 16:43:48,799 INFO httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 2013-06-03 16:43:48,851 INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected 2013-06-03 16:43:49,054 INFO httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@10.212.2.66:8080 2013-06-03 16:43:49,244 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2013-06-03 16:43:49,245 INFO parse.ParserChecker - parsing: http://www.google.com 2013-06-03 16:43:49,246 INFO parse.ParserChecker - contentType: text/html 2013-06-03 16:43:49,246 INFO parse.ParserChecker - signature: 0d50f5f66ddb69b21f21ab0ad5b3d034 Suresh. -----Original Message----- From: Suresh V S [mailto:[email protected]] Sent: Monday, June 03, 2013 1:37 PM To: [email protected] Subject: RE: Nutch not crawling fully Thanks for pointing out, Kiran. My bad I overlooked it. I'm trying hard to authenticate with our proxy but always ending up with HTTP 407. My conf/nutch-site.xml has the http.proxy.host, http.proxy.port, http.proxy.username, http.proxy.password values set correctly. The plugin.includes has the following: <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> Still, even google.com returns 407.. Any ideas? Thank you Suresh. -----Original Message----- From: kiran chitturi [mailto:[email protected]] Sent: Monday, June 03, 2013 10:44 AM To: [email protected] Subject: Re: Nutch not crawling fully > fetch of http://www.igate.com/ failed with: Http code=407, url= > http://www.igate.com <http://www.igate.com/ -finishing> Hi Suresh, The url is never successfully fetched. The http error code 407 is thrown away. That is the reason it is in unfetched status. > > > > > dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch readdb > mondaycrawl/crawldb/ -stats CrawlDb statistics start: > mondaycrawl/crawldb/ Statistics for CrawlDb: mondaycrawl/crawldb/ > TOTAL urls: 1 > retry 1: 1 > min score: 1.0 > avg score: 1.0 > max score: 1.0 > status 1 (db_unfetched): 1 > CrawlDb statistics: done > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Information contained and transmitted by this e-mail is confidential > and proprietary to iGATE and its affiliates and is intended for use > only by the recipient. If you are not the intended recipient, you are > hereby notified that any dissemination, distribution, copying or use > of this e-mail is strictly prohibited and you are requested to delete > this e-mail immediately and notify the originator or [email protected] > <mailto: > [email protected]>. iGATE does not enter into any agreement with any > party by e-mail. Any views expressed by an individual do not > necessarily reflect the view of iGATE. iGATE is not responsible for > the consequences of any actions taken on the basis of information provided, > through this email. > The contents of an attachment to this e-mail may contain software > viruses, which could damage your own computer system. While iGATE has > taken every reasonable precaution to minimise this risk, we cannot > accept liability for any damage which you sustain as a result of > software viruses. You should carry out your own virus checks before > opening an attachment. To know more about iGATE please visit www.igate.com > <http://www.igate.com>. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > -- Kiran Chitturi <http://www.linkedin.com/in/kiranchitturi> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Information contained and transmitted by this e-mail is confidential and proprietary to iGATE and its affiliates and is intended for use only by the recipient. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or use of this e-mail is strictly prohibited and you are requested to delete this e-mail immediately and notify the originator or [email protected] <mailto:[email protected]>. iGATE does not enter into any agreement with any party by e-mail. Any views expressed by an individual do not necessarily reflect the view of iGATE. iGATE is not responsible for the consequences of any actions taken on the basis of information provided, through this email. The contents of an attachment to this e-mail may contain software viruses, which could damage your own computer system. While iGATE has taken every reasonable precaution to minimise this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should carry out your own virus checks before opening an attachment. To know more about iGATE please visit www.igate.com <http://www.igate.com>. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Information contained and transmitted by this e-mail is confidential and proprietary to iGATE and its affiliates and is intended for use only by the recipient. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or use of this e-mail is strictly prohibited and you are requested to delete this e-mail immediately and notify the originator or [email protected] <mailto:[email protected]>. iGATE does not enter into any agreement with any party by e-mail. Any views expressed by an individual do not necessarily reflect the view of iGATE. iGATE is not responsible for the consequences of any actions taken on the basis of information provided, through this email. The contents of an attachment to this e-mail may contain software viruses, which could damage your own computer system. While iGATE has taken every reasonable precaution to minimise this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should carry out your own virus checks before opening an attachment. To know more about iGATE please visit www.igate.com <http://www.igate.com>. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

