You can turn on more logging. Add this to conf/log4j.properties: log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdoutlog4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout
Although I have never used this before, go through [0] and httpclient-auth.xml. [0] : https://wiki.apache.org/nutch/HttpAuthenticationSchemes On Mon, Jun 3, 2013 at 4:33 AM, Suresh V S <[email protected]> wrote: > The logs say that ntlm has been selected for the proxy authentication, but > the authentication continues to fail. > Below is the log section. http.proxy.username and http.proxy.password are > provided in conf/nutch-site.xml but they don't show up in the log. > > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.proxy.host = 10.x.y.z > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.proxy.port = 8080 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.timeout = 10000 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.content.limit = 65536 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.agent = > myspider/Nutch-1.6 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2013-06-03 16:43:48,799 INFO httpclient.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > 2013-06-03 16:43:48,851 INFO auth.AuthChallengeProcessor - ntlm > authentication scheme selected > 2013-06-03 16:43:49,054 INFO httpclient.HttpMethodDirector - Failure > authenticating with NTLM <any realm>@10.212.2.66:8080 > 2013-06-03 16:43:49,244 INFO crawl.SignatureFactory - Using Signature > impl: org.apache.nutch.crawl.MD5Signature > 2013-06-03 16:43:49,245 INFO parse.ParserChecker - parsing: > http://www.google.com > 2013-06-03 16:43:49,246 INFO parse.ParserChecker - contentType: text/html > 2013-06-03 16:43:49,246 INFO parse.ParserChecker - signature: > 0d50f5f66ddb69b21f21ab0ad5b3d034 > > Suresh. > > -----Original Message----- > From: Suresh V S [mailto:[email protected]] > Sent: Monday, June 03, 2013 1:37 PM > To: [email protected] > Subject: RE: Nutch not crawling fully > > Thanks for pointing out, Kiran. My bad I overlooked it. > > I'm trying hard to authenticate with our proxy but always ending up with > HTTP 407. > > My conf/nutch-site.xml has the http.proxy.host, http.proxy.port, > http.proxy.username, http.proxy.password values set correctly. > The plugin.includes has the following: > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. > By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please > enable > protocol-httpclient, but be aware of possible intermittent problems with > the > underlying commons-httpclient library. > </description> > </property> > > Still, even google.com returns 407.. Any ideas? > > Thank you > Suresh. > > > -----Original Message----- > From: kiran chitturi [mailto:[email protected]] > Sent: Monday, June 03, 2013 10:44 AM > To: [email protected] > Subject: Re: Nutch not crawling fully > > > fetch of http://www.igate.com/ failed with: Http code=407, url= > > http://www.igate.com <http://www.igate.com/ -finishing> > > > Hi Suresh, > > The url is never successfully fetched. The http error code 407 is thrown > away. That is the reason it is in unfetched status. > > > > > > > > > > > dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch readdb > > mondaycrawl/crawldb/ -stats CrawlDb statistics start: > > mondaycrawl/crawldb/ Statistics for CrawlDb: mondaycrawl/crawldb/ > > TOTAL urls: 1 > > retry 1: 1 > > min score: 1.0 > > avg score: 1.0 > > max score: 1.0 > > status 1 (db_unfetched): 1 > > CrawlDb statistics: done > > > > > > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~ > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Information contained and transmitted by this e-mail is confidential > > and proprietary to iGATE and its affiliates and is intended for use > > only by the recipient. If you are not the intended recipient, you are > > hereby notified that any dissemination, distribution, copying or use > > of this e-mail is strictly prohibited and you are requested to delete > > this e-mail immediately and notify the originator or > > [email protected]<mailto: > > [email protected]>. iGATE does not enter into any agreement with any > > party by e-mail. Any views expressed by an individual do not > > necessarily reflect the view of iGATE. iGATE is not responsible for > > the consequences of any actions taken on the basis of information > provided, through this email. > > The contents of an attachment to this e-mail may contain software > > viruses, which could damage your own computer system. While iGATE has > > taken every reasonable precaution to minimise this risk, we cannot > > accept liability for any damage which you sustain as a result of > > software viruses. You should carry out your own virus checks before > > opening an attachment. To know more about iGATE please visit > www.igate.com <http://www.igate.com>. > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > -- > Kiran Chitturi > > <http://www.linkedin.com/in/kiranchitturi> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Information contained and transmitted by this e-mail is confidential and > proprietary to iGATE and its affiliates and is intended for use only by the > recipient. If you are not the intended recipient, you are hereby notified > that any dissemination, distribution, copying or use of this e-mail is > strictly prohibited and you are requested to delete this e-mail immediately > and notify the originator or [email protected] <mailto: > [email protected]>. iGATE does not enter into any agreement with any > party by e-mail. Any views expressed by an individual do not necessarily > reflect the view of iGATE. iGATE is not responsible for the consequences of > any actions taken on the basis of information provided, through this email. > The contents of an attachment to this e-mail may contain software viruses, > which could damage your own computer system. While iGATE has taken every > reasonable precaution to minimise this risk, we cannot accept liability for > any damage which you sustain as a result of software viruses. You should > carry out your own virus checks before opening an attachment. To know more > about iGATE please visit www.igate.com <http://www.igate.com>. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Information contained and transmitted by this e-mail is confidential and > proprietary to iGATE and its affiliates and is intended for use only by the > recipient. If you are not the intended recipient, you are hereby notified > that any dissemination, distribution, copying or use of this e-mail is > strictly prohibited and you are requested to delete this e-mail immediately > and notify the originator or [email protected] <mailto: > [email protected]>. iGATE does not enter into any agreement with any > party by e-mail. Any views expressed by an individual do not necessarily > reflect the view of iGATE. iGATE is not responsible for the consequences of > any actions taken on the basis of information provided, through this email. > The contents of an attachment to this e-mail may contain software viruses, > which could damage your own computer system. While iGATE has taken every > reasonable precaution to minimise this risk, we cannot accept liability for > any damage which you sustain as a result of software viruses. You should > carry out your own virus checks before opening an attachment. To know more > about iGATE please visit www.igate.com <http://www.igate.com>. > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > >

