I don't know if protocol-httpclient is still working at all. To narrow down 
the problem check the HTTP logs of the protected server and your Nutch logs.

On Wednesday 07 September 2011 11:21:07 aceyin wrote:
>   Hi :
>     I met some strange problem when i try to use Nutch-1.3 . i list what I
> did bellow , hope there is someone can help me :
> 
> 1. Operations
> A.I tried to use Nutch-1.3 to crawl a web site which is protected by "Basic
> HTTP authorize" , but found that the nutch did not crawled anything after
> it finish running .After check the hudoop.log , I got some information
> bellow : 2011-09-07 04:11:37,539 WARN  crawl.Generator - Generator: 0
> records selected for fetching, exiting ... 2011-09-07 04:11:37,541 INFO 
> crawl.Crawl - Stopping at depth=1 - no more URLs to fetch. I tried to find
> answer by Google, but got no useful information.
> B.So , I change the URL to a public site (such as www.yahoo.com) and run
> the nutch crawl again , this time the nutch worked well - all page were
> crawled and indexed into solr 2. Configurations - the only difference of
> configuration files for the 2 operations is : for operationA the
> plugin.includes's value is
> :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|a
> nchor)|scoring-opic|urlnormalizer-(pass|regex|basic) for operationB the
> plugin.includes's value is
> :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)
> |scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml <property>
>   <name>plugin.includes</name>
>  
> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(b
> asic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> <description></description>
> </property>
> B. httpclient-auth.xml
> <auth-configuration>
> <credentials username="user" password="password">
>       <default/>
> </credentials>
> </auth-configuration>
> C. regex-urlfilter.txt
> -^(file|ftp|mailto):
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|
> tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[?*!@=]
> +.
> That's all configurations and operations i used, but for the site protected
> by "Basic HTTP authorize" i always got the error message . Could someone
> help me on this ?
> 
> Thanks a lot ~
> 
> //BR

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to