Hi, We are using nutch 1.2 to crawl our intranet pages that require authentication.
We followed the steps listed on nutch Wiki http://wiki.apache.org/nutch/HttpAuthenticationSchemes we have overridden the 'plugin.includes' property of 'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced 'protocol-http' with 'protocol-httpclient'. content of our nutch-site.xml: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.robots.agents</name> <value>nutch-solr-integration-test,*</value> <description></description> </property> <property> <name>http.agent.name</name> <value>nutch-solr-integration-test</value> <description>CPD AS Robots Name</description> </property> <property> <name>http.agent.description</name> <value>CPD Web Crawler using Nutch 1.2</value> <description></description> </property> <property> <name>http.agent.url</name> <value>http://devcpd1.lexus.com/</value> <description></description> </property> <property> <name>http.agent.email</name> <value>[email protected]</value> <description></description> </property> <property> <name>http.agent.version</name> <value></value> <description></description> </property> <property> <name>http.agent.host</name> <value>10.63.48.2</value> <description>Name or IP address of the host on which the Nutch crawler would be running. Currently this is used by 'protocol-httpclient' plugin. </description> </property> <property> <name>generate.max.per.host</name> <value>100</value> </property> <property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormal izer-(pass|regex|basic)</value> </property> <property> <name>http.useHttp11</name> <value>true</value> <description>NOTE: at the moment this works only for protocol-httpclient. If true, use HTTP 1.1, if false use HTTP 1.0 . </description> </property> </configuration> content of our httpclient-auth.xml <auth-configuration> <credentials username="148606" password="d1e9n7i6s"> <default/> <authscope host="10.52.112.12" port="80" scheme="NTLM"/> <authscope host="10.52.21.83" port="80" scheme="NTLM"/> </credentials> </auth-configuration> content of regex-urlfilter.txt # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. The crawler works against sites that with no authentication just fine. But when crawling the intranet pages with authentication, it fails with following message: 2011-02-23 15:37:00,214 INFO fetcher.Fetcher - fetching http://tv.tms.toyota.com/toyotavision/tv_links.asp 2011-02-23 15:37:00,214 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-02-23 15:37:00,215 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-02-23 15:37:00,214 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-02-23 15:37:00,215 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-02-23 15:37:00,216 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-02-23 15:37:00,216 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-02-23 15:37:00,216 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=1 2011-02-23 15:37:00,270 INFO httpclient.Http - http.proxy.host = null 2011-02-23 15:37:00,271 INFO httpclient.Http - http.proxy.port = 8080 2011-02-23 15:37:00,271 INFO httpclient.Http - http.timeout = 10000 2011-02-23 15:37:00,271 INFO httpclient.Http - http.content.limit = 65536 2011-02-23 15:37:00,271 INFO httpclient.Http - http.agent = nutch-solr-integration-test/Nutch-1.2 (CPD Web Crawler using Nutch 1.2; http://devcpd1.lexus.com/; [email protected]) 2011-02-23 15:37:00,271 INFO httpclient.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2011-02-23 15:37:00,271 INFO httpclient.Http - protocol.plugin.check.blocking = false 2011-02-23 15:37:00,272 INFO httpclient.Http - protocol.plugin.check.robots = false 2011-02-23 15:37:00,470 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2011-02-23 15:37:00,471 INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected 2011-02-23 15:37:00,471 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2011-02-23 15:37:00,471 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2011-02-23 15:37:00,509 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2011-02-23 15:37:00,509 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2011-02-23 15:37:01,226 INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 2011-02-23 15:37:01,278 DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm 2011-02-23 15:37:01,278 DEBUG auth.AuthChallengeProcessor - Authorization challenge processed 2011-02-23 15:37:01,278 INFO httpclient.HttpMethodDirector - Failure authenticating with NTLM <any realm>@tv.tms.toyota.com:80 2011-02-23 15:37:01,590 INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 2011-02-23 15:37:02,235 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 2011-02-23 15:37:02,236 INFO fetcher.Fetcher - -activeThreads=0

