So it is fetched. You can also check parse output by using the tool: bin/nutch org.apache.nutch.parse.ParserChecker <url> this also shows outlinks.
> Hi Markus > many thanks for your response, i'm sure the protocol-httpclient is > working from the hadoop.log : you can see in the log bellow, the nutch has > tried 2 times to crawl the protected page: the 1st time, nutch crawler got > "401" error, and then he try the 2nd time and the got the right result: > > > ----- the 1st time / 401 returned ---- > 2011-09-06 16:55:38,563 > org.apache.commons.httpclient.HttpMethodDirector.executeMethod:194 : DEBUG > httpclient.HttpMethodDirector - Retry authentication 2011-09-06 > 16:55:38,563 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.content - << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]" > 2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.content - << "<html><head>[\n]" 2011-09-06 16:55:38,564 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "<title>401 Authorization Required</title>[\n]" 2011-09-06 16:55:38,564 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "</head><body>[\n]" 2011-09-06 16:55:38,564 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "<h1>Authorization Required</h1>[\n]" 2011-09-06 16:55:38,564 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "<p>This server could not verify that you[\n]" 2011-09-06 16:55:38,564 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "are > authorized to access the document[\n]" 2011-09-06 16:55:38,564 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "requested. Either you supplied the wrong[\n]" 2011-09-06 16:55:38,564 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "credentials (e.g., bad password), or your[\n]" 2011-09-06 16:55:38,565 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "browser doesn't understand how to supply[\n]" 2011-09-06 16:55:38,565 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "the > credentials required.</p>[\n]" 2011-09-06 16:55:38,565 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "<hr>[\n]" 2011-09-06 16:55:38,565 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "<address>Apache/2.2.17 (Fedora) Server at xxxx.com Port 80</address> > [\n]" 2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 > : DEBUG wire.content - << "</body></html>[\n]" > > > ---- try the 2nd time ---- > 2011-09-06 16:55:38,565 > org.apache.commons.httpclient.HttpMethodBase.shouldCloseConnection:1008 : > DEBUG httpclient.HttpMethodBase - Should close connection in response to > directive: close 2011-09-06 16:55:38,566 > org.apache.commons.httpclient.HttpMethodDirector.authenticateHost:278 : > DEBUG httpclient.HttpMethodDirector - Authenticating with BASIC 'xxxx SVN > repository'@xxxx.com:80 2011-09-06 16:55:38,566 > org.apache.commons.httpclient.params.HttpMethodParams.getCredentialCharset > :384 : DEBUG params.HttpMethodParams - Credential charset not configured, > using HTTP element charset ---- got the right page source ---- > 2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - >> "GET http://xxxx.com/dev/xxxx/ HTTP/1.0[\r][\n]" > 2011-09-06 16:55:38,815 > org.apache.commons.httpclient.HttpMethodBase.addHostRequestHeader:1352 : > DEBUG httpclient.HttpMethodBase - Adding Host request header 2011-09-06 > 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - >> "User-Agent: nutch-1.3/Nutch-1.3[\r][\n]" 2011-09-06 > 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - >> "Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3[\r][\n]" > 2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - >> "Accept-Charset: utf-8,ISO-8859-1;q=0.7,*;q=0.7[\r][\n]" > 2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - >> "Accept: > text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9,text/ > plain;q=0.8,image/png,*/*;q=0.5[\r][\n]" 2011-09-06 16:55:38,816 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> > "Accept-Encoding: x-gzip, gzip, deflate[\r][\n]" 2011-09-06 16:55:38,816 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> > "Proxy-Connection: Keep-Alive[\r][\n]" 2011-09-06 16:55:38,816 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> > "Authorization: Basic ZW5pYXlpbjpjaGFuZ2VtZQ==[\r][\n]" 2011-09-06 > 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - >> "Host: xxxx.com[\r][\n]" 2011-09-06 16:55:38,817 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >> > "[\r][\n]" 2011-09-06 16:55:38,848 > org.apache.nutch.fetcher.Fetcher.run:1038 : INFO fetcher.Fetcher - > -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 2011-09-06 > 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - << "HTTP/1.0 200 OK[\r][\n]" 2011-09-06 16:55:39,118 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "HTTP/1.0 200 OK[\r][\n]" 2011-09-06 16:55:39,118 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Date: > Tue, 06 Sep 2011 08:55:39 GMT[\r][\n]" 2011-09-06 16:55:39,118 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "Server: Apache/2.2.17 (Fedora)[\r][\n]" 2011-09-06 16:55:39,119 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "Last-Modified: Thu, 28 Jul 2011 06:05:39 GMT[\r][\n]" 2011-09-06 > 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - << "ETag: W/"277655//xxxx/src"[\r][\n]" 2011-09-06 > 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.header - << "Accept-Ranges: bytes[\r][\n]" 2011-09-06 16:55:39,119 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "Content-Length: 528[\r][\n]" 2011-09-06 16:55:39,119 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "Content-Type: text/html; charset=UTF-8[\r][\n]" 2011-09-06 16:55:39,120 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "X-Cache: MISS from xxxx.com[\r][\n]" 2011-09-06 16:55:39,120 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "X-Cache-Lookup: MISS from xxxx.com:3128[\r][\n]" 2011-09-06 16:55:39,120 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Via: > 1.0 xxxx.com:3128 (squid/2.6.STABLE21)[\r][\n]" 2011-09-06 16:55:39,120 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "Proxy-Connection: keep-alive[\r][\n]" 2011-09-06 16:55:39,120 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << > "[\r][\n]" 2011-09-06 16:55:39,121 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << > "<html><head><title>dev - Revision 280006: /xxxx/src</title></head>[\n]" > 2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.content - << "<body>[\n]" 2011-09-06 16:55:39,121 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " > <h2>dev - Revision 280006: /xxxx/src</h2>[\n]" 2011-09-06 16:55:39,121 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " > <ul>[\n]" 2011-09-06 16:55:39,121 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " > <li><a href="../">..</a></li>[\n]" 2011-09-06 16:55:39,121 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " > <li><a href="com/">com/</a></li>[\n]" 2011-09-06 16:55:39,121 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " > <li><a > href="commons-logging.properties">commons-logging.properties</a></li>[\n]" > 2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.content - << " <li><a > href="simplelog.properties">simplelog.properties</a></li>[\n]" 2011-09-06 > 16:55:39,122 org.apache.commons.httpclient.Wire.wire:70 : DEBUG > wire.content - << " </ul>[\n]" 2011-09-06 16:55:39,122 > org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " <hr > noshade><em>Powered by <a > href="http://subversion.tigris.org/">Subversion</a> version 1.6.15 > (r1038135).</em>[\n]" 2011-09-06 16:55:39,122 > org.apache.commons.httpclient.Wire.wire:84 : DEBUG wire.content - << > "</body></html>" > > At 2011-09-07 19:46:28,"Markus Jelsma" <[email protected]> wrote: > >I don't know if protocol-httpclient is still working at all. To narrow > >down the problem check the HTTP logs of the protected server and your > >Nutch logs. > > > >On Wednesday 07 September 2011 11:21:07 aceyin wrote: > >> Hi : > >> I met some strange problem when i try to use Nutch-1.3 . i list what > >> I > >> > >> did bellow , hope there is someone can help me : > >> > >> 1. Operations > >> A.I tried to use Nutch-1.3 to crawl a web site which is protected by > >> "Basic HTTP authorize" , but found that the nutch did not crawled > >> anything after it finish running .After check the hudoop.log , I got > >> some information bellow : 2011-09-07 04:11:37,539 WARN crawl.Generator > >> - Generator: 0 records selected for fetching, exiting ... 2011-09-07 > >> 04:11:37,541 INFO crawl.Crawl - Stopping at depth=1 - no more URLs to > >> fetch. I tried to find answer by Google, but got no useful information. > >> B.So , I change the URL to a public site (such as www.yahoo.com) and run > >> the nutch crawl again , this time the nutch worked well - all page were > >> crawled and indexed into solr 2. Configurations - the only difference of > >> configuration files for the 2 operations is : for operationA the > >> plugin.includes's value is > >> > >> :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic > >> :|a > >> > >> nchor)|scoring-opic|urlnormalizer-(pass|regex|basic) for operationB the > >> plugin.includes's value is > >> > >> :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|ancho > >> :r) > >> : > >> |scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml > >> |<property> > >> | > >> <name>plugin.includes</name> > >> > >> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index- > >> (b asic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > >> <description></description> > >> </property> > >> B. httpclient-auth.xml > >> <auth-configuration> > >> <credentials username="user" password="password"> > >> > >> <default/> > >> > >> </credentials> > >> </auth-configuration> > >> C. regex-urlfilter.txt > >> -^(file|ftp|mailto): > >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r > >> pm| tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[?*!@=] > >> +. > >> That's all configurations and operations i used, but for the site > >> protected by "Basic HTTP authorize" i always got the error message . > >> Could someone help me on this ? > >> > >> Thanks a lot ~ > >> > >> //BR

