Hi :
I met some strange problem when i try to use Nutch-1.3 . i list what I did
bellow , hope there is someone can help me :
1. Operations
A.I tried to use Nutch-1.3 to crawl a web site which is protected by "Basic
HTTP authorize" , but found that the nutch did not crawled anything after it
finish running .After check the hudoop.log , I got some information bellow :
2011-09-07 04:11:37,539 WARN crawl.Generator - Generator: 0 records selected
for fetching, exiting ...
2011-09-07 04:11:37,541 INFO crawl.Crawl - Stopping at depth=1 - no more URLs
to fetch.
I tried to find answer by Google, but got no useful information.
B.So , I change the URL to a public site (such as www.yahoo.com) and run the
nutch crawl again , this time the nutch worked well - all page were crawled and
indexed into solr
2. Configurations - the only difference of configuration files for the 2
operations is :
for operationA the plugin.includes's value is
:protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
for operationB the plugin.includes's value is
:protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)A.
nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description></description>
</property>
B. httpclient-auth.xml
<auth-configuration>
<credentials username="user" password="password">
<default/>
</credentials>
</auth-configuration>
C. regex-urlfilter.txt
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-[?*!@=]
+.
That's all configurations and operations i used, but for the site protected by
"Basic HTTP authorize" i always got the error message .
Could someone help me on this ?
Thanks a lot ~
//BR