On Thu, Feb 24, 2011 at 5:23 AM, Carl Zha <[email protected]> wrote:
>
> Hi, We are using nutch 1.2 to crawl our intranet pages that require
> authentication.
>
> We followed the steps listed on nutch Wiki
>
> http://wiki.apache.org/nutch/HttpAuthenticationSchemes
>
>
> we have overridden the 'plugin.includes' property of
> 'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced
> 'protocol-http' with 'protocol-httpclient'.
>
> content of our nutch-site.xml:
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
> <property>
>    <name>http.robots.agents</name>
>    <value>nutch-solr-integration-test,*</value>
>    <description></description>
>  </property>
>  <property>
>    <name>http.agent.name</name>
>    <value>nutch-solr-integration-test</value>
>    <description>CPD AS Robots Name</description>
>  </property>
>  <property>
>    <name>http.agent.description</name>
>    <value>CPD Web Crawler using Nutch 1.2</value>
>    <description></description>
>  </property>
>  <property>
>    <name>http.agent.url</name>
>    <value>http://devcpd1.lexus.com/</value>
>    <description></description>
>  </property>
>  <property>
>    <name>http.agent.email</name>
>    <value>[email protected]</value>
>    <description></description>
>  </property>
>  <property>
>    <name>http.agent.version</name>
>    <value></value>
>    <description></description>
>  </property>
> <property>
>  <name>http.agent.host</name>
>  <value>10.63.48.2</value>
>  <description>Name or IP address of the host on which the Nutch crawler
>  would be running. Currently this is used by 'protocol-httpclient'
>  plugin.
>  </description>
> </property>
>  <property>
>    <name>generate.max.per.host</name>
>    <value>100</value>
>  </property>
>  <property>
>    <name>plugin.includes</name>
>
> <value>protocol-httpclient|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormal
> izer-(pass|regex|basic)</value>
>  </property>
> <property>
>  <name>http.useHttp11</name>
>  <value>true</value>
>  <description>NOTE: at the moment this works only for protocol-httpclient.
>  If true, use HTTP 1.1, if false use HTTP 1.0 .
>  </description>
> </property>
> </configuration>
>
> content of our httpclient-auth.xml
>
> <auth-configuration>
>        <credentials username="148606" password="d1e9n7i6s">
>          <default/>
>          <authscope host="10.52.112.12" port="80" scheme="NTLM"/>
>          <authscope host="10.52.21.83" port="80" scheme="NTLM"/>
>       </credentials>
> </auth-configuration>

Since, the crawler is trying to fetch pages from "tv.tms.toyota.com",
could you please try specifying "tv.tms.toyota.com" as the host
instead of it's IP address?

>
> content of regex-urlfilter.txt
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> -[*!@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> # accept anything else
> +.
>
>
>
> The crawler works against sites that with no authentication just fine. But
> when crawling the intranet pages with authentication, it fails with
> following message:
>
> 2011-02-23 15:37:00,214 INFO  fetcher.Fetcher - fetching
> http://tv.tms.toyota.com/toyotavision/tv_links.asp
> 2011-02-23 15:37:00,214 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-02-23 15:37:00,215 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-02-23 15:37:00,214 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-02-23 15:37:00,215 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-02-23 15:37:00,216 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-02-23 15:37:00,216 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-02-23 15:37:00,216 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=1
> 2011-02-23 15:37:00,270 INFO  httpclient.Http - http.proxy.host = null
> 2011-02-23 15:37:00,271 INFO  httpclient.Http - http.proxy.port = 8080
> 2011-02-23 15:37:00,271 INFO  httpclient.Http - http.timeout = 10000
> 2011-02-23 15:37:00,271 INFO  httpclient.Http - http.content.limit = 65536
> 2011-02-23 15:37:00,271 INFO  httpclient.Http - http.agent =
> nutch-solr-integration-test/Nutch-1.2 (CPD Web Crawler using Nutch 1.2;
> http://devcpd1.lexus.com/; [email protected])
> 2011-02-23 15:37:00,271 INFO  httpclient.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 2011-02-23 15:37:00,271 INFO  httpclient.Http -
> protocol.plugin.check.blocking = false
> 2011-02-23 15:37:00,272 INFO  httpclient.Http - protocol.plugin.check.robots
> = false
> 2011-02-23 15:37:00,470 DEBUG auth.AuthChallengeProcessor - Supported
> authentication schemes in the order of preference: [ntlm, digest, basic]
> 2011-02-23 15:37:00,471 INFO  auth.AuthChallengeProcessor - ntlm
> authentication scheme selected
> 2011-02-23 15:37:00,471 DEBUG auth.AuthChallengeProcessor - Using
> authentication scheme: ntlm
> 2011-02-23 15:37:00,471 DEBUG auth.AuthChallengeProcessor - Authorization
> challenge processed
> 2011-02-23 15:37:00,509 DEBUG auth.AuthChallengeProcessor - Using
> authentication scheme: ntlm
> 2011-02-23 15:37:00,509 DEBUG auth.AuthChallengeProcessor - Authorization
> challenge processed
> 2011-02-23 15:37:01,226 INFO  fetcher.Fetcher - -activeThreads=1,
> spinWaiting=0, fetchQueues.totalSize=0
> 2011-02-23 15:37:01,278 DEBUG auth.AuthChallengeProcessor - Using
> authentication scheme: ntlm
> 2011-02-23 15:37:01,278 DEBUG auth.AuthChallengeProcessor - Authorization
> challenge processed
> 2011-02-23 15:37:01,278 INFO  httpclient.HttpMethodDirector - Failure
> authenticating with NTLM <any realm>@tv.tms.toyota.com:80
> 2011-02-23 15:37:01,590 INFO  fetcher.Fetcher - -finishing thread
> FetcherThread, activeThreads=0
> 2011-02-23 15:37:02,235 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2011-02-23 15:37:02,236 INFO  fetcher.Fetcher - -activeThreads=0

Reply via email to