I turned on more logging as suggested by Tejas Patil. The log doesn't have any lines showing the proxy username or password used. Is it normal?
The log section is as below: 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Registered Extension-Points: 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2013-06-05 10:38:21,156 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2013-06-05 10:38:21,175 INFO httpclient.Http - http.proxy.host = blrproxy.igate.com 2013-06-05 10:38:21,175 INFO httpclient.Http - http.proxy.port = 8080 2013-06-05 10:38:21,175 INFO httpclient.Http - http.timeout = 10000 2013-06-05 10:38:21,175 INFO httpclient.Http - http.content.limit = 65536 2013-06-05 10:38:21,175 INFO httpclient.Http - http.agent = myspider/Nutch-1.6 2013-06-05 10:38:21,175 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2013-06-05 10:38:21,175 INFO httpclient.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 2013-06-05 10:38:21,239 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2013-06-05 10:38:21,239 INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected 2013-06-05 10:38:21,239 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2013-06-05 10:38:21,239 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2013-06-05 10:38:21,258 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2013-06-05 10:38:21,259 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2013-06-05 10:38:21,507 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2013-06-05 10:38:21,508 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2013-06-05 10:38:21,508 INFO httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@blrproxy.igate.com:8080 2013-06-05 10:38:21,701 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 2013-06-05 10:38:21,702 INFO parse.ParserChecker - parsing: http://www.apache.org 2013-06-05 10:38:21,702 INFO parse.ParserChecker - contentType: text/html 2013-06-05 10:38:21,702 INFO parse.ParserChecker - signature: 49db438e222f3ad7b689d55059dc249e -----Original Message----- From: Tejas Patil [mailto:[email protected]] Sent: Monday, June 03, 2013 8:00 PM To: [email protected] Subject: Re: Nutch not crawling fully You can turn on more logging. Add this to conf/log4j.properties: log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdoutlog4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout Although I have never used this before, go through [0] and httpclient-auth.xml. [0] : https://wiki.apache.org/nutch/HttpAuthenticationSchemes On Mon, Jun 3, 2013 at 4:33 AM, Suresh V S <[email protected]> wrote: > The logs say that ntlm has been selected for the proxy authentication, > but the authentication continues to fail. > Below is the log section. http.proxy.username and http.proxy.password > are provided in conf/nutch-site.xml but they don't show up in the log. > > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.proxy.host = > 10.x.y.z > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.proxy.port = 8080 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.timeout = 10000 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.content.limit = > 65536 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.agent = > myspider/Nutch-1.6 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2013-06-03 16:43:48,851 INFO auth.AuthChallengeProcessor - ntlm > authentication scheme selected > 2013-06-03 16:43:49,054 INFO httpclient.HttpMethodDirector - Failure > authenticating with NTLM <any realm>@10.212.2.66:8080 > 2013-06-03 16:43:49,244 INFO crawl.SignatureFactory - Using Signature > impl: org.apache.nutch.crawl.MD5Signature > 2013-06-03 16:43:49,245 INFO parse.ParserChecker - parsing: > http://www.google.com > 2013-06-03 16:43:49,246 INFO parse.ParserChecker - contentType: > text/html > 2013-06-03 16:43:49,246 INFO parse.ParserChecker - signature: > 0d50f5f66ddb69b21f21ab0ad5b3d034 > > Suresh. > > -----Original Message----- > From: Suresh V S [mailto:[email protected]] > Sent: Monday, June 03, 2013 1:37 PM > To: [email protected] > Subject: RE: Nutch not crawling fully > > Thanks for pointing out, Kiran. My bad I overlooked it. > > I'm trying hard to authenticate with our proxy but always ending up > with HTTP 407. > > My conf/nutch-site.xml has the http.proxy.host, http.proxy.port, > http.proxy.username, http.proxy.password values set correctly. > The plugin.includes has the following: > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems > with the > underlying commons-httpclient library. > </description> > </property> > > Still, even google.com returns 407.. Any ideas? > > Thank you > Suresh. > > > -----Original Message----- > From: kiran chitturi [mailto:[email protected]] > Sent: Monday, June 03, 2013 10:44 AM > To: [email protected] > Subject: Re: Nutch not crawling fully > > > fetch of http://www.igate.com/ failed with: Http code=407, url= > > http://www.igate.com <http://www.igate.com/ -finishing> > > > Hi Suresh, > > The url is never successfully fetched. The http error code 407 is > thrown away. That is the reason it is in unfetched status. > > > > > > > > > > > dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch > > readdb mondaycrawl/crawldb/ -stats CrawlDb statistics start: > > mondaycrawl/crawldb/ Statistics for CrawlDb: mondaycrawl/crawldb/ > > TOTAL urls: 1 > > retry 1: 1 > > min score: 1.0 > > avg score: 1.0 > > max score: 1.0 > > status 1 (db_unfetched): 1 > > CrawlDb statistics: done > > > > > > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer > > ~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Information contained and transmitted by this e-mail is confidential > > and proprietary to iGATE and its affiliates and is intended for use > > only by the recipient. If you are not the intended recipient, you > > are hereby notified that any dissemination, distribution, copying or > > use of this e-mail is strictly prohibited and you are requested to > > delete this e-mail immediately and notify the originator or > > [email protected]<mailto: > > [email protected]>. iGATE does not enter into any agreement with > > any party by e-mail. Any views expressed by an individual do not > > necessarily reflect the view of iGATE. iGATE is not responsible for > > the consequences of any actions taken on the basis of information > provided, through this email. > > The contents of an attachment to this e-mail may contain software > > viruses, which could damage your own computer system. While iGATE > > has taken every reasonable precaution to minimise this risk, we > > cannot accept liability for any damage which you sustain as a result > > of software viruses. You should carry out your own virus checks > > before opening an attachment. To know more about iGATE please visit > www.igate.com <http://www.igate.com>. > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > -- > Kiran Chitturi > > <http://www.linkedin.com/in/kiranchitturi> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Information contained and transmitted by this e-mail is confidential > and proprietary to iGATE and its affiliates and is intended for use > only by the recipient. If you are not the intended recipient, you are > hereby notified that any dissemination, distribution, copying or use > of this e-mail is strictly prohibited and you are requested to delete > this e-mail immediately and notify the originator or [email protected] > <mailto: > [email protected]>. iGATE does not enter into any agreement with any > party by e-mail. Any views expressed by an individual do not > necessarily reflect the view of iGATE. iGATE is not responsible for > the consequences of any actions taken on the basis of information provided, > through this email. > The contents of an attachment to this e-mail may contain software > viruses, which could damage your own computer system. While iGATE has > taken every reasonable precaution to minimise this risk, we cannot > accept liability for any damage which you sustain as a result of > software viruses. You should carry out your own virus checks before > opening an attachment. To know more about iGATE please visit www.igate.com > <http://www.igate.com>. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Information contained and transmitted by this e-mail is confidential > and proprietary to iGATE and its affiliates and is intended for use > only by the recipient. If you are not the intended recipient, you are > hereby notified that any dissemination, distribution, copying or use > of this e-mail is strictly prohibited and you are requested to delete > this e-mail immediately and notify the originator or [email protected] > <mailto: > [email protected]>. iGATE does not enter into any agreement with any > party by e-mail. Any views expressed by an individual do not > necessarily reflect the view of iGATE. iGATE is not responsible for > the consequences of any actions taken on the basis of information provided, > through this email. > The contents of an attachment to this e-mail may contain software > viruses, which could damage your own computer system. While iGATE has > taken every reasonable precaution to minimise this risk, we cannot > accept liability for any damage which you sustain as a result of > software viruses. You should carry out your own virus checks before > opening an attachment. To know more about iGATE please visit www.igate.com > <http://www.igate.com>. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Information contained and transmitted by this e-mail is confidential and proprietary to iGATE and its affiliates and is intended for use only by the recipient. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or use of this e-mail is strictly prohibited and you are requested to delete this e-mail immediately and notify the originator or [email protected] <mailto:[email protected]>. iGATE does not enter into any agreement with any party by e-mail. Any views expressed by an individual do not necessarily reflect the view of iGATE. iGATE is not responsible for the consequences of any actions taken on the basis of information provided, through this email. The contents of an attachment to this e-mail may contain software viruses, which could damage your own computer system. While iGATE has taken every reasonable precaution to minimise this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should carry out your own virus checks before opening an attachment. To know more about iGATE please visit www.igate.com <http://www.igate.com>. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

