On Thu, Feb 24, 2011 at 5:23 AM, Carl Zha <[email protected]> wrote: > > Hi, We are using nutch 1.2 to crawl our intranet pages that require > authentication. > > We followed the steps listed on nutch Wiki > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes > > > we have overridden the 'plugin.includes' property of > 'conf/nutch-default.xml' with 'conf/nutch-site.xml' and replaced > 'protocol-http' with 'protocol-httpclient'. > > content of our nutch-site.xml: > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>http.robots.agents</name> > <value>nutch-solr-integration-test,*</value> > <description></description> > </property> > <property> > <name>http.agent.name</name> > <value>nutch-solr-integration-test</value> > <description>CPD AS Robots Name</description> > </property> > <property> > <name>http.agent.description</name> > <value>CPD Web Crawler using Nutch 1.2</value> > <description></description> > </property> > <property> > <name>http.agent.url</name> > <value>http://devcpd1.lexus.com/</value> > <description></description> > </property> > <property> > <name>http.agent.email</name> > <value>[email protected]</value> > <description></description> > </property> > <property> > <name>http.agent.version</name> > <value></value> > <description></description> > </property> > <property> > <name>http.agent.host</name> > <value>10.63.48.2</value> > <description>Name or IP address of the host on which the Nutch crawler > would be running. Currently this is used by 'protocol-httpclient' > plugin. > </description> > </property> > <property> > <name>generate.max.per.host</name> > <value>100</value> > </property> > <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormal > izer-(pass|regex|basic)</value> > </property> > <property> > <name>http.useHttp11</name> > <value>true</value> > <description>NOTE: at the moment this works only for protocol-httpclient. > If true, use HTTP 1.1, if false use HTTP 1.0 . > </description> > </property> > </configuration> > > content of our httpclient-auth.xml > > <auth-configuration> > <credentials username="148606" password="d1e9n7i6s"> > <default/> > <authscope host="10.52.112.12" port="80" scheme="NTLM"/> > <authscope host="10.52.21.83" port="80" scheme="NTLM"/> > </credentials> > </auth-configuration>
Since, the crawler is trying to fetch pages from "tv.tms.toyota.com", could you please try specifying "tv.tms.toyota.com" as the host instead of it's IP address? > > content of regex-urlfilter.txt > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > -\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ > > # skip URLs containing certain characters as probable queries, etc. > -[*!@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept anything else > +. > > > > The crawler works against sites that with no authentication just fine. But > when crawling the intranet pages with authentication, it fails with > following message: > > 2011-02-23 15:37:00,214 INFO fetcher.Fetcher - fetching > http://tv.tms.toyota.com/toyotavision/tv_links.asp > 2011-02-23 15:37:00,214 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-02-23 15:37:00,215 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-02-23 15:37:00,214 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-02-23 15:37:00,215 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-02-23 15:37:00,216 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-02-23 15:37:00,216 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-02-23 15:37:00,216 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=1 > 2011-02-23 15:37:00,270 INFO httpclient.Http - http.proxy.host = null > 2011-02-23 15:37:00,271 INFO httpclient.Http - http.proxy.port = 8080 > 2011-02-23 15:37:00,271 INFO httpclient.Http - http.timeout = 10000 > 2011-02-23 15:37:00,271 INFO httpclient.Http - http.content.limit = 65536 > 2011-02-23 15:37:00,271 INFO httpclient.Http - http.agent = > nutch-solr-integration-test/Nutch-1.2 (CPD Web Crawler using Nutch 1.2; > http://devcpd1.lexus.com/; [email protected]) > 2011-02-23 15:37:00,271 INFO httpclient.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > 2011-02-23 15:37:00,271 INFO httpclient.Http - > protocol.plugin.check.blocking = false > 2011-02-23 15:37:00,272 INFO httpclient.Http - protocol.plugin.check.robots > = false > 2011-02-23 15:37:00,470 DEBUG auth.AuthChallengeProcessor - Supported > authentication schemes in the order of preference: [ntlm, digest, basic] > 2011-02-23 15:37:00,471 INFO auth.AuthChallengeProcessor - ntlm > authentication scheme selected > 2011-02-23 15:37:00,471 DEBUG auth.AuthChallengeProcessor - Using > authentication scheme: ntlm > 2011-02-23 15:37:00,471 DEBUG auth.AuthChallengeProcessor - Authorization > challenge processed > 2011-02-23 15:37:00,509 DEBUG auth.AuthChallengeProcessor - Using > authentication scheme: ntlm > 2011-02-23 15:37:00,509 DEBUG auth.AuthChallengeProcessor - Authorization > challenge processed > 2011-02-23 15:37:01,226 INFO fetcher.Fetcher - -activeThreads=1, > spinWaiting=0, fetchQueues.totalSize=0 > 2011-02-23 15:37:01,278 DEBUG auth.AuthChallengeProcessor - Using > authentication scheme: ntlm > 2011-02-23 15:37:01,278 DEBUG auth.AuthChallengeProcessor - Authorization > challenge processed > 2011-02-23 15:37:01,278 INFO httpclient.HttpMethodDirector - Failure > authenticating with NTLM <any realm>@tv.tms.toyota.com:80 > 2011-02-23 15:37:01,590 INFO fetcher.Fetcher - -finishing thread > FetcherThread, activeThreads=0 > 2011-02-23 15:37:02,235 INFO fetcher.Fetcher - -activeThreads=0, > spinWaiting=0, fetchQueues.totalSize=0 > 2011-02-23 15:37:02,236 INFO fetcher.Fetcher - -activeThreads=0

