Hi, We are using nutch 1.2 to crawl our intranet pages that require
authentication.

We followed the steps listed on nutch Wiki

http://wiki.apache.org/nutch/HttpAuthenticationSchemes


we have overridden the 'plugin.includes' property of
'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced
'protocol-http' with 'protocol-httpclient'.

content of our nutch-site.xml:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
    <name>http.robots.agents</name>
    <value>nutch-solr-integration-test,*</value>
    <description></description>
  </property>
  <property>
    <name>http.agent.name</name>
    <value>nutch-solr-integration-test</value>
    <description>CPD AS Robots Name</description>
  </property>
  <property>
    <name>http.agent.description</name>
    <value>CPD Web Crawler using Nutch 1.2</value>
    <description></description>
  </property>
  <property>
    <name>http.agent.url</name>
    <value>http://devcpd1.lexus.com/</value>
    <description></description>
  </property>
  <property>
    <name>http.agent.email</name>
    <value>[email protected]</value>
    <description></description>
  </property>
  <property>
    <name>http.agent.version</name>
    <value></value>
    <description></description>
  </property>
<property>
  <name>http.agent.host</name>
  <value>10.63.48.2</value>
  <description>Name or IP address of the host on which the Nutch crawler
  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>
  <property>
    <name>generate.max.per.host</name>
    <value>100</value>
  </property>
  <property>
    <name>plugin.includes</name>

<value>protocol-httpclient|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormal
izer-(pass|regex|basic)</value>
  </property>
<property>
  <name>http.useHttp11</name>
  <value>true</value>
  <description>NOTE: at the moment this works only for protocol-httpclient.
  If true, use HTTP 1.1, if false use HTTP 1.0 .
  </description>
</property>
</configuration>

content of our httpclient-auth.xml

<auth-configuration>
        <credentials username="148606" password="d1e9n7i6s">
          <default/>
          <authscope host="10.52.112.12" port="80" scheme="NTLM"/>
          <authscope host="10.52.21.83" port="80" scheme="NTLM"/>
       </credentials>
</auth-configuration>

content of regex-urlfilter.txt

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
-[*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.



The crawler works against sites that with no authentication just fine. But
when crawling the intranet pages with authentication, it fails with
following message:

2011-02-23 15:37:00,214 INFO  fetcher.Fetcher - fetching
http://tv.tms.toyota.com/toyotavision/tv_links.asp
2011-02-23 15:37:00,214 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-02-23 15:37:00,215 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-02-23 15:37:00,214 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-02-23 15:37:00,215 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-02-23 15:37:00,216 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-02-23 15:37:00,216 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-02-23 15:37:00,216 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=1
2011-02-23 15:37:00,270 INFO  httpclient.Http - http.proxy.host = null
2011-02-23 15:37:00,271 INFO  httpclient.Http - http.proxy.port = 8080
2011-02-23 15:37:00,271 INFO  httpclient.Http - http.timeout = 10000
2011-02-23 15:37:00,271 INFO  httpclient.Http - http.content.limit = 65536
2011-02-23 15:37:00,271 INFO  httpclient.Http - http.agent =
nutch-solr-integration-test/Nutch-1.2 (CPD Web Crawler using Nutch 1.2;
http://devcpd1.lexus.com/; [email protected])
2011-02-23 15:37:00,271 INFO  httpclient.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3
2011-02-23 15:37:00,271 INFO  httpclient.Http -
protocol.plugin.check.blocking = false
2011-02-23 15:37:00,272 INFO  httpclient.Http - protocol.plugin.check.robots
= false
2011-02-23 15:37:00,470 DEBUG auth.AuthChallengeProcessor - Supported
authentication schemes in the order of preference: [ntlm, digest, basic]
2011-02-23 15:37:00,471 INFO  auth.AuthChallengeProcessor - ntlm
authentication scheme selected
2011-02-23 15:37:00,471 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2011-02-23 15:37:00,471 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2011-02-23 15:37:00,509 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2011-02-23 15:37:00,509 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2011-02-23 15:37:01,226 INFO  fetcher.Fetcher - -activeThreads=1,
spinWaiting=0, fetchQueues.totalSize=0
2011-02-23 15:37:01,278 DEBUG auth.AuthChallengeProcessor - Using
authentication scheme: ntlm
2011-02-23 15:37:01,278 DEBUG auth.AuthChallengeProcessor - Authorization
challenge processed
2011-02-23 15:37:01,278 INFO  httpclient.HttpMethodDirector - Failure
authenticating with NTLM <any realm>@tv.tms.toyota.com:80
2011-02-23 15:37:01,590 INFO  fetcher.Fetcher - -finishing thread
FetcherThread, activeThreads=0
2011-02-23 15:37:02,235 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0
2011-02-23 15:37:02,236 INFO  fetcher.Fetcher - -activeThreads=0

Reply via email to