I have replaced <iis74.intranet> is just a string replacement for
our actual intranet name something like blah.intranet.org, and
I use the <> convention when I obscuring actual data.    

What might the log4js.properties entry for httpclient.Http ?  I see
it is only at INFO level logging, but I do not know that proper
object path to set it up.  

Thanks,
Bob

>Hi Bob,
>
>Do you write host as <iis75.intranet> or iis75.intranet ?
>
>Kind Regards,
>Furkan KAMACI

-----Original Message-----
From: Bell, Bob 
Sent: Wednesday, November 02, 2016 12:17 PM
To: '[email protected]' <[email protected]>
Cc: Bell, Bob <[email protected]>
Subject: Nutch 1.12 NTLM authentication IIS 7.5 Intranet

I have been trying for more than a year to get NTLM to work with IIS 7.5 
without success.   I was
happy to see the 1.12 recent release, and thought ok I will give it shot again. 
 I am almost to point where I do not believe it works with ntlm, or it does not 
know how to handle the multiple 401's
that are returned, or I have some fundamental problem somewhere ?    I have 
tried everything I 
could think of, and am at loss on how to solve this mystery.    My Nutch server 
is a Centos 7 in a 
Virtual Box.    I am using the httpclient as indicated in the docs but with no 
love.      I can fetch with 
anonymous, but I need ntlm to work. 

I am using plugin.includes = >protocol-httpclient

nutch-site.xml:
<property>
<name>http.auth.file</name>
<value>httpclient-auth.xml</value>
<description>Authentication configuration file for 'protocol-httpclient' plugin.
</description>
</property>

httpclient-auth.xml for local user:
<auth-configuration>
    <credentials username="nutch" password="<somepassword>">
        <default  scheme="basic" port="80"/>
    </credentials>
</auth-configuration>

Here is output with local user account on the server, one thing I notice, is 
that I cannot force authentication
to be anything other than ntlm, even though I support ntlm, basic, and digest.  
 Notice the scheme was basic,
but it goes though ntlm regardless. 

[root@localhost nutch]# nutch parsechecker http://<iis75.intranet>
fetching: http://<iis75.intranet>
Whitelisted hosts: [<iis75.intranet>]
http.proxy.host = null
http.proxy.port = 8080
http.proxy.exception.list = false
http.timeout = 36000
http.content.limit = 65536
http.agent = APL-Nutch-Spider/Nutch-1.12 http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3 http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Credentials - username: nutch; set as default for realm: ; scheme: basic 
Pre-configured credentials with scope -  host: <iis75.intranet>; port: 80; not 
found for url: http://<iis75.intranet> Authorization required Supported 
authentication schemes in the order of preference: [ntlm, digest, basic] ntlm 
authentication scheme selected Using authentication scheme: ntlm Authorization 
challenge processed Authentication scope: NTLM <any realm>@<iis75.intranet>:80 
Credentials required Credentials provider not available No credentials 
available for NTLM <any realm>@<iis75.intranet>:80
url: http://<iis75.intranet>; status code: 401; bytes received: 0; 
Content-Length: 0
401 Authentication Required
Fetch failed with protocol status: access_denied(17), lastModified=0: 
Authentication required: http://<iis75.intranet> [root@localhost nutch]#


httpclient-auth.xml for domain  user:
<auth-configuration>
    <credentials username="<domainuser>" password="<domainpassword>
        <default host="<iis75.intranet>" scheme="ntlm" port="80" 
realm="<domain>"/>
    </credentials>
</auth-configuration>

note: doesn’t matter what I put in the host, doesn’t seem to change anything. 

[root@localhost nutch]# nutch parsechecker http://<iis75.intranet>
fetching: http://<iis75.intranet>
Whitelisted hosts: [<iis75.intranet>]
http.proxy.host = null
http.proxy.port = 8080
http.proxy.exception.list = false
http.timeout = 36000
http.content.limit = 65536
http.agent = APL-Nutch-Spider/Nutch-1.12 http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3 http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Credentials - username: <domainuser>"; set as default for realm: =<domain>; 
scheme: ntlm Pre-configured credentials with scope -  host: <iis75.intranet>; 
port: 80; not found for url: http://<iis75.intranet> Authorization required 
Supported authentication schemes in the order of preference: [ntlm, digest, 
basic] ntlm authentication scheme selected Using authentication scheme: ntlm 
Authorization challenge processed Authentication scope: NTLM <any 
realm>@<iis75.intranet>:80 Retry authentication Authenticating with NTLM <any 
realm>@<iis75.intranet>:80 enter NTLMScheme.authenticate(Credentials, 
HttpMethod) Authorization required Using authentication scheme: ntlm 
Authorization challenge processed Authentication scope: NTLM <any 
realm>@<iis75.intranet>:80 Retry authentication Authenticating with NTLM <any 
realm>@<iis75.intranet>:80 enter NTLMScheme.authenticate(Credentials, 
HttpMethod) Authorization required Using authentication scheme: ntlm 
Authorization challenge processed Authentication scope: NTLM <any 
realm>@<iis75.intranet>:80 Credentials required Credentials provider not 
available Failure authenticating with NTLM <any realm>@<iis75.intranet>:80
url: http://<iis75.intranet>; status code: 401; bytes received: 0; 
Content-Length: 0
401 Authentication Required
Fetch failed with protocol status: access_denied(17), lastModified=0: 
Authentication required: http://<iis75.intranet>

Last entry in  Hadoop.log:

2016-11-02 12:08:49,568 INFO  parse.ParserChecker - fetching: 
http://<iis75.intranet>
2016-11-02 12:08:50,040 DEBUG util.ObjectCache - No object cache found for 
conf=Configuration: core-default.xml, core-site.xml, nutch-default.xml, 
nutch-site.xml, instantiating a new object cache
2016-11-02 12:08:50,119 INFO  protocol.RobotRulesParser - Whitelisted hosts: 
[<iis75.intranet>]
2016-11-02 12:08:50,119 INFO  httpclient.Http - http.proxy.host = null
2016-11-02 12:08:50,119 INFO  httpclient.Http - http.proxy.port = 8080
2016-11-02 12:08:50,119 INFO  httpclient.Http - http.proxy.exception.list = 
false
2016-11-02 12:08:50,119 INFO  httpclient.Http - http.timeout = 36000
2016-11-02 12:08:50,119 INFO  httpclient.Http - http.content.limit = 65536
2016-11-02 12:08:50,119 INFO  httpclient.Http - http.agent = 
APL-Nutch-Spider/Nutch-1.12 ([email protected])
2016-11-02 12:08:50,120 INFO  httpclient.Http - http.accept.language = 
en-us,en-gb,en;q=0.7,*;q=0.3
2016-11-02 12:08:50,120 INFO  httpclient.Http - http.accept = 
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2016-11-02 12:08:50,133 TRACE httpclient.Http - Credentials - username: 
<domainuser>; set as default for realm: <domain>; scheme: ntlm
2016-11-02 12:08:50,134 TRACE httpclient.Http - Pre-configured credentials with 
scope -  host: <iis75.intranet>; port: 80; not found for url: 
http://<iis75.intranet>
2016-11-02 12:08:50,313 DEBUG httpclient.HttpMethodDirector - Authorization 
required
2016-11-02 12:08:50,320 DEBUG auth.AuthChallengeProcessor - Supported 
authentication schemes in the order of preference: [ntlm, digest, basic]
2016-11-02 12:08:50,320 INFO  auth.AuthChallengeProcessor - ntlm authentication 
scheme selected
2016-11-02 12:08:50,320 DEBUG auth.AuthChallengeProcessor - Using 
authentication scheme: ntlm
2016-11-02 12:08:50,320 DEBUG auth.AuthChallengeProcessor - Authorization 
challenge processed
2016-11-02 12:08:50,320 DEBUG httpclient.HttpMethodDirector - Authentication 
scope: NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,320 DEBUG httpclient.HttpMethodDirector - Retry 
authentication
2016-11-02 12:08:50,321 DEBUG httpclient.HttpMethodDirector - Authenticating 
with NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,321 TRACE auth.NTLMScheme - enter 
NTLMScheme.authenticate(Credentials, HttpMethod)
2016-11-02 12:08:50,351 DEBUG httpclient.HttpMethodDirector - Authorization 
required
2016-11-02 12:08:50,352 DEBUG auth.AuthChallengeProcessor - Using 
authentication scheme: ntlm
2016-11-02 12:08:50,352 DEBUG auth.AuthChallengeProcessor - Authorization 
challenge processed
2016-11-02 12:08:50,352 DEBUG httpclient.HttpMethodDirector - Authentication 
scope: NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,352 DEBUG httpclient.HttpMethodDirector - Retry 
authentication
2016-11-02 12:08:50,352 DEBUG httpclient.HttpMethodDirector - Authenticating 
with NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,352 TRACE auth.NTLMScheme - enter 
NTLMScheme.authenticate(Credentials, HttpMethod)
2016-11-02 12:08:50,393 DEBUG httpclient.HttpMethodDirector - Authorization 
required
2016-11-02 12:08:50,393 DEBUG auth.AuthChallengeProcessor - Using 
authentication scheme: ntlm
2016-11-02 12:08:50,393 DEBUG auth.AuthChallengeProcessor - Authorization 
challenge processed
2016-11-02 12:08:50,393 DEBUG httpclient.HttpMethodDirector - Authentication 
scope: NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,393 DEBUG httpclient.HttpMethodDirector - Credentials 
required
2016-11-02 12:08:50,393 DEBUG httpclient.HttpMethodDirector - Credentials 
provider not available
2016-11-02 12:08:50,393 INFO  httpclient.HttpMethodDirector - Failure 
authenticating with NTLM <any realm>@<iis75.intranet>:80
2016-11-02 12:08:50,395 TRACE httpclient.Http - url: http://<iis75.intranet>; 
status code: 401; bytes received: 0; Content-Length: 0
2016-11-02 12:08:50,681 DEBUG util.ObjectCache - No object cache found for 
conf=Configuration: core-default.xml, core-site.xml, nutch-default.xml, 
nutch-site.xml, instantiating a new object cache
2016-11-02 12:08:50,804 TRACE httpclient.Http - 401 Authentication Required

Any help is appreciated, as I am about to move on to another spirder for solr. 

Thanks,
Bob

Reply via email to