I don't know if protocol-httpclient is still working at all. To narrow down the problem check the HTTP logs of the protected server and your Nutch logs.
On Wednesday 07 September 2011 11:21:07 aceyin wrote: > Hi : > I met some strange problem when i try to use Nutch-1.3 . i list what I > did bellow , hope there is someone can help me : > > 1. Operations > A.I tried to use Nutch-1.3 to crawl a web site which is protected by "Basic > HTTP authorize" , but found that the nutch did not crawled anything after > it finish running .After check the hudoop.log , I got some information > bellow : 2011-09-07 04:11:37,539 WARN crawl.Generator - Generator: 0 > records selected for fetching, exiting ... 2011-09-07 04:11:37,541 INFO > crawl.Crawl - Stopping at depth=1 - no more URLs to fetch. I tried to find > answer by Google, but got no useful information. > B.So , I change the URL to a public site (such as www.yahoo.com) and run > the nutch crawl again , this time the nutch worked well - all page were > crawled and indexed into solr 2. Configurations - the only difference of > configuration files for the 2 operations is : for operationA the > plugin.includes's value is > :protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|a > nchor)|scoring-opic|urlnormalizer-(pass|regex|basic) for operationB the > plugin.includes's value is > :protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor) > |scoring-opic|urlnormalizer-(pass|regex|basic)A. nutch-site.xml <property> > <name>plugin.includes</name> > > <value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(b > asic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > <description></description> > </property> > B. httpclient-auth.xml > <auth-configuration> > <credentials username="user" password="password"> > <default/> > </credentials> > </auth-configuration> > C. regex-urlfilter.txt > -^(file|ftp|mailto): > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm| > tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -[?*!@=] > +. > That's all configurations and operations i used, but for the site protected > by "Basic HTTP authorize" i always got the error message . Could someone > help me on this ? > > Thanks a lot ~ > > //BR -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

