Hi Markus
    many thanks for your response, i'm sure the protocol-httpclient is working 
from the hadoop.log :
    you can see in the log bellow, the nutch has tried 2 times to crawl the 
protected page: 
    the 1st time, nutch crawler got "401" error, and then he try the 2nd time 
and the got the right result:


    ----- the 1st time / 401 returned ----
    2011-09-06 16:55:38,563 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod:194 : DEBUG 
httpclient.HttpMethodDirector - Retry authentication
    2011-09-06 16:55:38,563 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<html><head>[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<title>401 Authorization Required</title>[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "</head><body>[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<h1>Authorization Required</h1>[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<p>This server could not verify that you[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "are authorized to access the document[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "requested.  Either you supplied the wrong[\n]"
    2011-09-06 16:55:38,564 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "credentials (e.g., bad password), or your[\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "browser doesn't understand how to supply[\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "the credentials required.</p>[\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<hr>[\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<address>Apache/2.2.17 (Fedora) Server at xxxx.com Port 
80</address>        [\n]"
    2011-09-06 16:55:38,565 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "</body></html>[\n]"


    ---- try the 2nd time ----
2011-09-06 16:55:38,565 
org.apache.commons.httpclient.HttpMethodBase.shouldCloseConnection:1008 : DEBUG 
httpclient.HttpMethodBase - Should close connection in response to directive: 
close
2011-09-06 16:55:38,566 
org.apache.commons.httpclient.HttpMethodDirector.authenticateHost:278 : DEBUG 
httpclient.HttpMethodDirector - Authenticating with BASIC 'xxxx SVN 
repository'@xxxx.com:80
2011-09-06 16:55:38,566 
org.apache.commons.httpclient.params.HttpMethodParams.getCredentialCharset:384 
: DEBUG params.HttpMethodParams - Credential charset not configured, using HTTP 
element charset
    ---- got the right page source ----
2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "GET http://xxxx.com/dev/xxxx/ HTTP/1.0[\r][\n]"
2011-09-06 16:55:38,815 
org.apache.commons.httpclient.HttpMethodBase.addHostRequestHeader:1352 : DEBUG 
httpclient.HttpMethodBase - Adding Host request header
2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "User-Agent: nutch-1.3/Nutch-1.3[\r][\n]"
2011-09-06 16:55:38,815 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "Accept-Charset: utf-8,ISO-8859-1;q=0.7,*;q=0.7[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "Accept: 
text/html,application/xml;q=0.9,application/xhtml+xml,text/xml;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "Accept-Encoding: x-gzip, gzip, deflate[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "Proxy-Connection: Keep-Alive[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "Authorization: Basic ZW5pYXlpbjpjaGFuZ2VtZQ==[\r][\n]"
2011-09-06 16:55:38,816 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "Host: xxxx.com[\r][\n]"
2011-09-06 16:55:38,817 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - >> "[\r][\n]"
2011-09-06 16:55:38,848 org.apache.nutch.fetcher.Fetcher.run:1038 : INFO  
fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
2011-09-06 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "HTTP/1.0 200 OK[\r][\n]"
2011-09-06 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "HTTP/1.0 200 OK[\r][\n]"
2011-09-06 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "Date: Tue, 06 Sep 2011 08:55:39 GMT[\r][\n]"
2011-09-06 16:55:39,118 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "Server: Apache/2.2.17 (Fedora)[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "Last-Modified: Thu, 28 Jul 2011 06:05:39 GMT[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "ETag: W/"277655//xxxx/src"[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "Accept-Ranges: bytes[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "Content-Length: 528[\r][\n]"
2011-09-06 16:55:39,119 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "Content-Type: text/html; charset=UTF-8[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "X-Cache: MISS from xxxx.com[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "X-Cache-Lookup: MISS from xxxx.com:3128[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "Via: 1.0 xxxx.com:3128 (squid/2.6.STABLE21)[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "Proxy-Connection: keep-alive[\r][\n]"
2011-09-06 16:55:39,120 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.header - << "[\r][\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<html><head><title>dev - Revision 280006: 
/xxxx/src</title></head>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "<body>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << " <h2>dev - Revision 280006: /xxxx/src</h2>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << " <ul>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "  <li><a href="../">..</a></li>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "  <li><a href="com/">com/</a></li>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "  <li><a 
href="commons-logging.properties">commons-logging.properties</a></li>[\n]"
2011-09-06 16:55:39,121 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << "  <li><a 
href="simplelog.properties">simplelog.properties</a></li>[\n]"
2011-09-06 16:55:39,122 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << " </ul>[\n]"
2011-09-06 16:55:39,122 org.apache.commons.httpclient.Wire.wire:70 : DEBUG 
wire.content - << " <hr noshade><em>Powered by <a 
href="http://subversion.tigris.org/";>Subversion</a> version 1.6.15 
(r1038135).</em>[\n]"
2011-09-06 16:55:39,122 org.apache.commons.httpclient.Wire.wire:84 : DEBUG 
wire.content - << "</body></html>"





At 2011-09-07 19:46:28,"Markus Jelsma" <[email protected]> wrote:
>I don't know if protocol-httpclient is still working at all. To narrow down 
>the problem check the HTTP logs of the protected server and your Nutch logs.
>
>On Wednesday 07 September 2011 11:21:07 aceyin wrote:
>>   Hi :
>>     I met some strange problem when i try to use Nutch-1.3 . i list what I
>> did bellow , hope there is someone can help me :
>> 
>> 1. Operations
>> A.I tried to use Nutch-1.3 to crawl a web site which is protected by "Basic
>> HTTP authorize" , but found that the nutch did not crawled anything after
>> it finish running .After check the hudoop.log , I got some information
>> bellow : 2011-09-07 04:11:37,539 WARN  crawl.Generator - Generator: 0
>> records selected for fetching, exiting ... 2011-09-07 04:11:37,541 INFO 
>> crawl.Crawl - Stopping at depth=1 - no more URLs to fetch. I tried to find
>> answer by Google, but got no useful information.
>> B.So , I change the URL to a public site (such as www.yahoo.com) and run
>> the nutch crawl again , this time the nutch worked well - all page were
>> crawled and indexed into solr 2. Configurations - the only difference of
>> configuration files for the 2 operations is : for operationA the
>> plugin.includes's value is
>> :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|a
>> nchor)|scoring-opic|urlnormalizer-(pass|regex|basic) for operationB the
>> plugin.includes's value is
>> :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)
>> |scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml <property>
>>   <name>plugin.includes</name>
>>  
>> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(b
>> asic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>> <description></description>
>> </property>
>> B. httpclient-auth.xml
>> <auth-configuration>
>> <credentials username="user" password="password">
>>       <default/>
>> </credentials>
>> </auth-configuration>
>> C. regex-urlfilter.txt
>> -^(file|ftp|mailto):
>> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
>> tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[?*!@=]
>> +.
>> That's all configurations and operations i used, but for the site protected
>> by "Basic HTTP authorize" i always got the error message . Could someone
>> help me on this ?
>> 
>> Thanks a lot ~
>> 
>> //BR
>
>-- 
>Markus Jelsma - CTO - Openindex
>http://www.linkedin.com/in/markus17
>050-8536620 / 06-50258350

Reply via email to