I have been trying for more than a year to get NTLM to work with IIS 7.5
without success. I was
happy to see the 1.12 recent release, and thought ok I will give it shot again.
I am almost to point
where I do not believe it works with ntlm, or it does not know how to handle
the multiple 401's
that are returned, or I have some fundamental problem somewhere ? I have
tried everything I
could think of, and am at loss on how to solve this mystery. My Nutch server
is a Centos 7 in a
Virtual Box. I am using the httpclient as indicated in the docs but with no
love. I can fetch with
anonymous, but I need ntlm to work.
I am using plugin.includes = >protocol-httpclient
nutch-site.xml:
<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.
</description>
</property>
httpclient-auth.xml for local user:
<auth-configuration>
<credentials username="nutch" password="<somepassword>">
<default scheme="basic" port="80"/>
</credentials>
</auth-configuration>
Here is output with local user account on the server, one thing I notice, is
that I cannot force authentication
to be anything other than ntlm, even though I support ntlm, basic, and digest.
Notice the scheme was basic,
but it goes though ntlm regardless.
[root@localhost nutch]# nutch parsechecker http://<iis75.intranet>
fetching: http://<iis75.intranet>
Whitelisted hosts: [<iis75.intranet>]
http.proxy.host = null
http.proxy.port = 8080
http.proxy.exception.list = false
http.timeout = 36000
http.content.limit = 65536
http.agent = APL-Nutch-Spider/Nutch-1.12
http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Credentials - username: nutch; set as default for realm: ; scheme: basic
Pre-configured credentials with scope - host: <iis75.intranet>; port: 80; not
found for url: http://<iis75.intranet>
Authorization required
Supported authentication schemes in the order of preference: [ntlm, digest,
basic]
ntlm authentication scheme selected
Using authentication scheme: ntlm
Authorization challenge processed
Authentication scope: NTLM <any realm>@<iis75.intranet>:80
Credentials required
Credentials provider not available
No credentials available for NTLM <any realm>@<iis75.intranet>:80
url: http://<iis75.intranet>; status code: 401; bytes received: 0;
Content-Length: 0
401 Authentication Required
Fetch failed with protocol status: access_denied(17), lastModified=0:
Authentication required: http://<iis75.intranet>
[root@localhost nutch]#
httpclient-auth.xml for domain user:
<auth-configuration>
<credentials username="<domainuser>" password="<domainpassword>
<default host="<iis75.intranet>" scheme="ntlm" port="80"
realm="<domain>"/>
</credentials>
</auth-configuration>
note: doesn’t matter what I put in the host, doesn’t seem to change anything.
[root@localhost nutch]# nutch parsechecker http://<iis75.intranet>
fetching: http://<iis75.intranet>
Whitelisted hosts: [<iis75.intranet>]
http.proxy.host = null
http.proxy.port = 8080
http.proxy.exception.list = false
http.timeout = 36000
http.content.limit = 65536
http.agent = APL-Nutch-Spider/Nutch-1.12
http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3
http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Credentials - username: <domainuser>"; set as default for realm: =<domain>;
scheme: ntlm
Pre-configured credentials with scope - host: <iis75.intranet>; port: 80; not
found for url: http://<iis75.intranet>
Authorization required
Supported authentication schemes in the order of preference: [ntlm, digest,
basic]
ntlm authentication scheme selected
Using authentication scheme: ntlm
Authorization challenge processed
Authentication scope: NTLM <any realm>@<iis75.intranet>:80
Retry authentication
Authenticating with NTLM <any realm>@<iis75.intranet>:80
enter NTLMScheme.authenticate(Credentials, HttpMethod)
Authorization required
Using authentication scheme: ntlm
Authorization challenge processed
Authentication scope: NTLM <any realm>@<iis75.intranet>:80
Retry authentication
Authenticating with NTLM <any realm>@<iis75.intranet>:80
enter NTLMScheme.authenticate(Credentials, HttpMethod)
Authorization required
Using authentication scheme: ntlm
Authorization challenge processed
Authentication scope: NTLM <any realm>@<iis75.intranet>:80
Credentials required
Credentials provider not available
Failure authenticating with NTLM <any realm>@<iis75.intranet>:80
url: http://<iis75.intranet>; status code: 401; bytes received: 0;
Content-Length: 0
401 Authentication Required
Fetch failed with protocol status: access_denied(17), lastModified=0:
Authentication required: http://<iis75.intranet>
Last entry in Hadoop.log:
2016-11-02 12:08:49,568 INFO parse.ParserChecker - fetching:
http://<iis75.intranet>
2016-11-02 12:08:50,040 DEBUG util.ObjectCache - No object cache found for
conf=Configuration: core-default.xml, core-site.xml, nutch-default.xml,
nutch-site.xml, instantiating a new object cache
2016-11-02 12:08:50,119 INFO protocol.RobotRulesParser - Whitelisted hosts:
[<iis75.intranet>]
2016-11-02 12:08:50,119 INFO httpclient.Http - http.proxy.host = null
2016-11-02 12:08:50,119 INFO httpclient.Http - http.proxy.port = 8080
2016-11-02 12:08:50,119 INFO httpclient.Http - http.proxy.exception.list =
false
2016-11-02 12:08:50,119 INFO httpclient.Http - http.timeout = 36000
2016-11-02 12:08:50,119 INFO httpclient.Http - http.content.limit = 65536
2016-11-02 12:08:50,119 INFO httpclient.Http - http.agent =
APL-Nutch-Spider/Nutch-1.12 ([email protected])
2016-11-02 12:08:50,120 INFO httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2016-11-02 12:08:50,120 INFO httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2016-11-02 12:08:50,133 TRACE httpclient.Http - Credentials - username:
<domainuser>; set as default for realm: <domain>; scheme: ntlm
2016-11-02 12:08:50,134 TRACE httpclient.Http - Pre-configured credentials with
scope - host: <iis75.intranet>; port: 80; not found for url:
http://<iis75.intranet>
2016-11-02 12:08:50,313 DEBUG httpclient.HttpMethodDirector - Authorization
required
2016-11-02 12:08:50,320 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]
2016-11-02 12:08:50,320 INFO auth.AuthChallengeProcessor - ntlm authentication
scheme selected
2016-11-02 12:08:50,320 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2016-11-02 12:08:50,320 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2016-11-02 12:08:50,320 DEBUG httpclient.HttpMethodDirector - Authentication
scope: NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,320 DEBUG httpclient.HttpMethodDirector - Retry
authentication
2016-11-02 12:08:50,321 DEBUG httpclient.HttpMethodDirector - Authenticating
with NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,321 TRACE auth.NTLMScheme - enter
NTLMScheme.authenticate(Credentials, HttpMethod)
2016-11-02 12:08:50,351 DEBUG httpclient.HttpMethodDirector - Authorization
required
2016-11-02 12:08:50,352 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2016-11-02 12:08:50,352 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2016-11-02 12:08:50,352 DEBUG httpclient.HttpMethodDirector - Authentication
scope: NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,352 DEBUG httpclient.HttpMethodDirector - Retry
authentication
2016-11-02 12:08:50,352 DEBUG httpclient.HttpMethodDirector - Authenticating
with NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,352 TRACE auth.NTLMScheme - enter
NTLMScheme.authenticate(Credentials, HttpMethod)
2016-11-02 12:08:50,393 DEBUG httpclient.HttpMethodDirector - Authorization
required
2016-11-02 12:08:50,393 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2016-11-02 12:08:50,393 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2016-11-02 12:08:50,393 DEBUG httpclient.HttpMethodDirector - Authentication
scope: NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,393 DEBUG httpclient.HttpMethodDirector - Credentials
required
2016-11-02 12:08:50,393 DEBUG httpclient.HttpMethodDirector - Credentials
provider not available
2016-11-02 12:08:50,393 INFO httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,395 TRACE httpclient.Http - url: http://<iis75.intranet>;
status code: 401; bytes received: 0; Content-Length: 0
2016-11-02 12:08:50,681 DEBUG util.ObjectCache - No object cache found for
conf=Configuration: core-default.xml, core-site.xml, nutch-default.xml,
nutch-site.xml, instantiating a new object cache
2016-11-02 12:08:50,804 TRACE httpclient.Http - 401 Authentication Required
Any help is appreciated, as I am about to move on to another spirder for solr.
Thanks,
Bob