You can turn on more logging. Add this to conf/log4j.properties:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdoutlog4j.logger.org.apache.commons.httpclient.auth=DEBUG,cmdstdout

Although I have never used this before, go through [0]
and httpclient-auth.xml.

[0] : https://wiki.apache.org/nutch/HttpAuthenticationSchemes

On Mon, Jun 3, 2013 at 4:33 AM, Suresh V S <[email protected]> wrote:

> The logs say that ntlm has been selected for the proxy authentication, but
> the authentication continues to fail.
> Below is the log section. http.proxy.username and http.proxy.password are
> provided in conf/nutch-site.xml but they don't show up in the log.
>
> 2013-06-03 16:43:48,799 INFO  httpclient.Http - http.proxy.host = 10.x.y.z
> 2013-06-03 16:43:48,799 INFO  httpclient.Http - http.proxy.port = 8080
> 2013-06-03 16:43:48,799 INFO  httpclient.Http - http.timeout = 10000
> 2013-06-03 16:43:48,799 INFO  httpclient.Http - http.content.limit = 65536
> 2013-06-03 16:43:48,799 INFO  httpclient.Http - http.agent =
> myspider/Nutch-1.6
> 2013-06-03 16:43:48,799 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2013-06-03 16:43:48,799 INFO  httpclient.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 2013-06-03 16:43:48,851 INFO  auth.AuthChallengeProcessor - ntlm
> authentication scheme selected
> 2013-06-03 16:43:49,054 INFO  httpclient.HttpMethodDirector - Failure
> authenticating with NTLM <any realm>@10.212.2.66:8080
> 2013-06-03 16:43:49,244 INFO  crawl.SignatureFactory - Using Signature
> impl: org.apache.nutch.crawl.MD5Signature
> 2013-06-03 16:43:49,245 INFO  parse.ParserChecker - parsing:
> http://www.google.com
> 2013-06-03 16:43:49,246 INFO  parse.ParserChecker - contentType: text/html
> 2013-06-03 16:43:49,246 INFO  parse.ParserChecker - signature:
> 0d50f5f66ddb69b21f21ab0ad5b3d034
>
> Suresh.
>
> -----Original Message-----
> From: Suresh V S [mailto:[email protected]]
> Sent: Monday, June 03, 2013 1:37 PM
> To: [email protected]
> Subject: RE: Nutch not crawling fully
>
> Thanks for pointing out, Kiran. My bad I overlooked it.
>
> I'm trying hard to authenticate with our proxy but always ending up with
> HTTP 407.
>
> My conf/nutch-site.xml has the http.proxy.host, http.proxy.port,
> http.proxy.username, http.proxy.password values set correctly.
> The plugin.includes has the following:
> <property>
>   <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>   <description>Regular expression naming plugin directory names to
>   include.  Any plugin not matching this expression is excluded.
>   In any case you need at least include the nutch-extensionpoints plugin.
> By
>   default Nutch includes crawling just HTML and plain text via HTTP,
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
>   underlying commons-httpclient library.
>   </description>
> </property>
>
> Still, even google.com returns 407.. Any ideas?
>
> Thank you
> Suresh.
>
>
> -----Original Message-----
> From: kiran chitturi [mailto:[email protected]]
> Sent: Monday, June 03, 2013 10:44 AM
> To: [email protected]
> Subject: Re: Nutch not crawling fully
>
> > fetch of http://www.igate.com/ failed with: Http code=407, url=
> > http://www.igate.com <http://www.igate.com/ -finishing>
>
>
> Hi Suresh,
>
> The url is never successfully fetched. The http error code 407 is thrown
> away. That is the reason it is in unfetched status.
>
> >
> >
> >
> >
> > dwbilab01@dwbilab01-OptiPlex-990:~/apache-nutch-1.6$ bin/nutch readdb
> > mondaycrawl/crawldb/ -stats CrawlDb statistics start:
> > mondaycrawl/crawldb/ Statistics for CrawlDb: mondaycrawl/crawldb/
> > TOTAL urls:     1
> > retry 1:        1
> > min score:      1.0
> > avg score:      1.0
> > max score:      1.0
> > status 1 (db_unfetched):        1
> > CrawlDb statistics: done
> >
> >
> >
> >
> >
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Information contained and transmitted by this e-mail is confidential
> > and proprietary to iGATE and its affiliates and is intended for use
> > only by the recipient. If you are not the intended recipient, you are
> > hereby notified that any dissemination, distribution, copying or use
> > of this e-mail is strictly prohibited and you are requested to delete
> > this e-mail immediately and notify the originator or 
> > [email protected]<mailto:
> > [email protected]>. iGATE does not enter into any agreement with any
> > party by e-mail. Any views expressed by an individual do not
> > necessarily reflect the view of iGATE. iGATE is not responsible for
> > the consequences of any actions taken on the basis of information
> provided, through this email.
> > The contents of an attachment to this e-mail may contain software
> > viruses, which could damage your own computer system. While iGATE has
> > taken every reasonable precaution to minimise this risk, we cannot
> > accept liability for any damage which you sustain as a result of
> > software viruses. You should carry out your own virus checks before
> > opening an attachment. To know more about iGATE please visit
> www.igate.com <http://www.igate.com>.
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
>
>
>
> --
> Kiran Chitturi
>
> <http://www.linkedin.com/in/kiranchitturi>
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Information contained and transmitted by this e-mail is confidential and
> proprietary to iGATE and its affiliates and is intended for use only by the
> recipient. If you are not the intended recipient, you are hereby notified
> that any dissemination, distribution, copying or use of this e-mail is
> strictly prohibited and you are requested to delete this e-mail immediately
> and notify the originator or [email protected] <mailto:
> [email protected]>. iGATE does not enter into any agreement with any
> party by e-mail. Any views expressed by an individual do not necessarily
> reflect the view of iGATE. iGATE is not responsible for the consequences of
> any actions taken on the basis of information provided, through this email.
> The contents of an attachment to this e-mail may contain software viruses,
> which could damage your own computer system. While iGATE has taken every
> reasonable precaution to minimise this risk, we cannot accept liability for
> any damage which you sustain as a result of software viruses. You should
> carry out your own virus checks before opening an attachment. To know more
> about iGATE please visit www.igate.com <http://www.igate.com>.
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Disclaimer~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Information contained and transmitted by this e-mail is confidential and
> proprietary to iGATE and its affiliates and is intended for use only by the
> recipient. If you are not the intended recipient, you are hereby notified
> that any dissemination, distribution, copying or use of this e-mail is
> strictly prohibited and you are requested to delete this e-mail immediately
> and notify the originator or [email protected] <mailto:
> [email protected]>. iGATE does not enter into any agreement with any
> party by e-mail. Any views expressed by an individual do not necessarily
> reflect the view of iGATE. iGATE is not responsible for the consequences of
> any actions taken on the basis of information provided, through this email.
> The contents of an attachment to this e-mail may contain software viruses,
> which could damage your own computer system. While iGATE has taken every
> reasonable precaution to minimise this risk, we cannot accept liability for
> any damage which you sustain as a result of software viruses. You should
> carry out your own virus checks before opening an attachment. To know more
> about iGATE please visit www.igate.com <http://www.igate.com>.
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>

Reply via email to