All -

I've tried several 1.x versions of Nutch and a variety of configurations and
simply can NOT get NTLM authentication working with Nutch. I need help
desperately!

Here are the relevent configuration points:
Note: "user", "password", and "ntdomain" are, of course, fillers for real
values

httpclient-auth.xml:
<credentials username="user" password="password" >
        <default realm="ntdomain" /> 
</credentials>

nutch-site.xml:
<property>
  <name>plugin.includes</name>
 
<value>protocol-(http|httpclient)|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description> </description>
</property>

logged problem (note that, yes, this is from 1.5.1, but 1.15 produces
similar results):
2019-04-25 07:38:47,641 INFO  parse.ParserChecker - fetching:
http://url.com/crawltest.html
2019-04-25 07:38:47,650 INFO  plugin.PluginRepository - Plugins: looking in:
C:\nutch\apache-nutch-1.5.1\plugins
2019-04-25 07:38:47,728 INFO  plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository - Registered Plugins:
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository -         Html Parse 
Plug-in
(parse-html)
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository -         HTTP Framework
(lib-http)
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository -         Http / Https
Protocol Plug-in (protocol-httpclient)
2019-04-25 07:38:47,729 INFO  plugin.PluginRepository -         Regex URL Filter
(urlfilter-regex)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         the nutch core
extension points (nutch-extensionpoints)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Basic Indexing
Filter (index-basic)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Anchor Indexing
Filter (index-anchor)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Tika Parser 
Plug-in
(parse-tika)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Basic URL
Normalizer (urlnormalizer-basic)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Regex URL Filter
Framework (lib-regex-filter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Regex URL
Normalizer (urlnormalizer-regex)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         URL Validator
(urlfilter-validator)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         CyberNeko HTML
Parser (lib-nekohtml)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Pass-through URL
Normalizer (urlnormalizer-pass)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         OPIC Scoring
Plug-in (scoring-opic)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Http Protocol
Plug-in (protocol-http)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository - Registered
Extension-Points:
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Nutch Content
Parser (org.apache.nutch.parse.Parser)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         HTML Parse 
Filter
(org.apache.nutch.parse.HtmlParseFilter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Nutch Segment 
Merge
Filter (org.apache.nutch.segment.SegmentMergeFilter)
2019-04-25 07:38:47,733 INFO  plugin.PluginRepository -         Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
2019-04-25 07:38:47,761 INFO  httpclient.Http - http.proxy.host = null
2019-04-25 07:38:47,762 INFO  httpclient.Http - http.proxy.port = 8080
2019-04-25 07:38:47,763 INFO  httpclient.Http - http.timeout = 10000
2019-04-25 07:38:47,763 INFO  httpclient.Http - http.content.limit = -1
2019-04-25 07:38:47,763 INFO  httpclient.Http - http.agent = Ulinenet
Spider/Nutch-1.5.1
2019-04-25 07:38:47,764 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2019-04-25 07:38:47,764 INFO  httpclient.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
2019-04-25 07:38:47,835 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]
2019-04-25 07:38:47,836 INFO  auth.AuthChallengeProcessor - ntlm
authentication scheme selected
2019-04-25 07:38:47,837 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:47,837 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:47,847 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:47,847 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:48,335 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2019-04-25 07:38:48,336 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2019-04-25 07:38:48,337 INFO  httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@url.com:80
2019-04-25 07:38:48,507 INFO  crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2019-04-25 07:38:48,509 INFO  parse.ParserChecker - parsing:
http://url.com/crawltest.html
2019-04-25 07:38:48,509 INFO  parse.ParserChecker - contentType:
application/xhtml+xml
2019-04-25 07:38:48,510 INFO  parse.ParserChecker - signature:
495abb7f991fb4dd6a056f748908a2d9

The way i'm testing:
bin/nutch parsechecker http://url.com/crawltest.html

Finally, I should note that the following curl command DOES work:
curl --ntlm --user user:password http://url.com/crawltest.html






--
Sent from: http://lucene.472066.n3.nabble.com/Nutch-User-f603147.html

Reply via email to