So it is fetched. You can also check parse output by using the tool: bin/nutch 
org.apache.nutch.parse.ParserChecker <url> this also shows outlinks.


> Hi Markus
>     many thanks for your response, i'm sure the protocol-httpclient is
> working from the hadoop.log : you can see in the log bellow, the nutch has
> tried 2 times to crawl the protected page: the 1st time, nutch crawler got
> "401" error, and then he try the 2nd time and the got the right result:
> 
> 
>     ----- the 1st time / 401 returned ----
>     2011-09-06 16:55:38,563
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod:194 : DEBUG
> httpclient.HttpMethodDirector - Retry authentication 2011-09-06
> 16:55:38,563 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]"
> 2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << "<html><head>[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<title>401 Authorization Required</title>[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "</head><body>[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<h1>Authorization Required</h1>[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<p>This server could not verify that you[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "are
> authorized to access the document[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "requested.  Either you supplied the wrong[\n]" 2011-09-06 16:55:38,564
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "credentials (e.g., bad password), or your[\n]" 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "browser doesn't understand how to supply[\n]" 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "the
> credentials required.</p>[\n]" 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<hr>[\n]" 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<address>Apache/2.2.17 (Fedora) Server at xxxx.com Port 80</address>     
>   [\n]" 2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70
> : DEBUG wire.content - << "</body></html>[\n]"
> 
> 
>     ---- try the 2nd time ----
> 2011-09-06 16:55:38,565
> org.apache.commons.httpclient.HttpMethodBase.shouldCloseConnection:1008 :
> DEBUG httpclient.HttpMethodBase - Should close connection in response to
> directive: close 2011-09-06 16:55:38,566
> org.apache.commons.httpclient.HttpMethodDirector.authenticateHost:278 :
> DEBUG httpclient.HttpMethodDirector - Authenticating with BASIC 'xxxx SVN
> repository'@xxxx.com:80 2011-09-06 16:55:38,566
> org.apache.commons.httpclient.params.HttpMethodParams.getCredentialCharset
> :384 : DEBUG params.HttpMethodParams - Credential charset not configured,
> using HTTP element charset ---- got the right page source ----
> 2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "GET http://xxxx.com/dev/xxxx/ HTTP/1.0[\r][\n]"
> 2011-09-06 16:55:38,815
> org.apache.commons.httpclient.HttpMethodBase.addHostRequestHeader:1352 :
> DEBUG httpclient.HttpMethodBase - Adding Host request header 2011-09-06
> 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "User-Agent: nutch-1.3/Nutch-1.3[\r][\n]" 2011-09-06
> 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3[\r][\n]"
> 2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "Accept-Charset: utf-8,ISO-8859-1;q=0.7,*;q=0.7[\r][\n]"
> 2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "Accept:
> text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9,text/
> plain;q=0.8,image/png,*/*;q=0.5[\r][\n]" 2011-09-06 16:55:38,816
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >>
> "Accept-Encoding: x-gzip, gzip, deflate[\r][\n]" 2011-09-06 16:55:38,816
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >>
> "Proxy-Connection: Keep-Alive[\r][\n]" 2011-09-06 16:55:38,816
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >>
> "Authorization: Basic ZW5pYXlpbjpjaGFuZ2VtZQ==[\r][\n]" 2011-09-06
> 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - >> "Host: xxxx.com[\r][\n]" 2011-09-06 16:55:38,817
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - >>
> "[\r][\n]" 2011-09-06 16:55:38,848
> org.apache.nutch.fetcher.Fetcher.run:1038 : INFO  fetcher.Fetcher -
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 2011-09-06
> 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - << "HTTP/1.0 200 OK[\r][\n]" 2011-09-06 16:55:39,118
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "HTTP/1.0 200 OK[\r][\n]" 2011-09-06 16:55:39,118
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Date:
> Tue, 06 Sep 2011 08:55:39 GMT[\r][\n]" 2011-09-06 16:55:39,118
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Server: Apache/2.2.17 (Fedora)[\r][\n]" 2011-09-06 16:55:39,119
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Last-Modified: Thu, 28 Jul 2011 06:05:39 GMT[\r][\n]" 2011-09-06
> 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - << "ETag: W/"277655//xxxx/src"[\r][\n]" 2011-09-06
> 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.header - << "Accept-Ranges: bytes[\r][\n]" 2011-09-06 16:55:39,119
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Content-Length: 528[\r][\n]" 2011-09-06 16:55:39,119
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Content-Type: text/html; charset=UTF-8[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "X-Cache: MISS from xxxx.com[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "X-Cache-Lookup: MISS from xxxx.com:3128[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - << "Via:
> 1.0 xxxx.com:3128 (squid/2.6.STABLE21)[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "Proxy-Connection: keep-alive[\r][\n]" 2011-09-06 16:55:39,120
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.header - <<
> "[\r][\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - <<
> "<html><head><title>dev - Revision 280006: /xxxx/src</title></head>[\n]"
> 2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << "<body>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "
> <h2>dev - Revision 280006: /xxxx/src</h2>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << "
> <ul>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " 
> <li><a href="../">..</a></li>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " 
> <li><a href="com/">com/</a></li>[\n]" 2011-09-06 16:55:39,121
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " 
> <li><a
> href="commons-logging.properties">commons-logging.properties</a></li>[\n]"
> 2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << "  <li><a
> href="simplelog.properties">simplelog.properties</a></li>[\n]" 2011-09-06
> 16:55:39,122 org.apache.commons.httpclient.Wire.wire:70 : DEBUG
> wire.content - << " </ul>[\n]" 2011-09-06 16:55:39,122
> org.apache.commons.httpclient.Wire.wire:70 : DEBUG wire.content - << " <hr
> noshade><em>Powered by <a
> href="http://subversion.tigris.org/";>Subversion</a> version 1.6.15
> (r1038135).</em>[\n]" 2011-09-06 16:55:39,122
> org.apache.commons.httpclient.Wire.wire:84 : DEBUG wire.content - <<
> "</body></html>"
> 
> At 2011-09-07 19:46:28,"Markus Jelsma" <[email protected]> wrote:
> >I don't know if protocol-httpclient is still working at all. To narrow
> >down the problem check the HTTP logs of the protected server and your
> >Nutch logs.
> >
> >On Wednesday 07 September 2011 11:21:07 aceyin wrote:
> >>   Hi :
> >>     I met some strange problem when i try to use Nutch-1.3 . i list what
> >>     I
> >> 
> >> did bellow , hope there is someone can help me :
> >> 
> >> 1. Operations
> >> A.I tried to use Nutch-1.3 to crawl a web site which is protected by
> >> "Basic HTTP authorize" , but found that the nutch did not crawled
> >> anything after it finish running .After check the hudoop.log , I got
> >> some information bellow : 2011-09-07 04:11:37,539 WARN  crawl.Generator
> >> - Generator: 0 records selected for fetching, exiting ... 2011-09-07
> >> 04:11:37,541 INFO crawl.Crawl - Stopping at depth=1 - no more URLs to
> >> fetch. I tried to find answer by Google, but got no useful information.
> >> B.So , I change the URL to a public site (such as www.yahoo.com) and run
> >> the nutch crawl again , this time the nutch worked well - all page were
> >> crawled and indexed into solr 2. Configurations - the only difference of
> >> configuration files for the 2 operations is : for operationA the
> >> plugin.includes's value is
> >> 
> >> :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic
> >> :|a
> >> 
> >> nchor)|scoring-opic|urlnormalizer-(pass|regex|basic) for operationB the
> >> plugin.includes's value is
> >> 
> >> :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|ancho
> >> :r)
> >> :
> >> |scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml
> >> |<property>
> >> |
> >>   <name>plugin.includes</name>
> >> 
> >> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-
> >> (b asic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> >> <description></description>
> >> </property>
> >> B. httpclient-auth.xml
> >> <auth-configuration>
> >> <credentials username="user" password="password">
> >> 
> >>       <default/>
> >> 
> >> </credentials>
> >> </auth-configuration>
> >> C. regex-urlfilter.txt
> >> -^(file|ftp|mailto):
> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|r
> >> pm| tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[?*!@=]
> >> +.
> >> That's all configurations and operations i used, but for the site
> >> protected by "Basic HTTP authorize" i always got the error message .
> >> Could someone help me on this ?
> >> 
> >> Thanks a lot ~
> >> 
> >> //BR

Reply via email to