Hi :
    I met some strange problem when i try to use Nutch-1.3 . i list what I did 
bellow , hope there is someone can help me :

1. Operations
A.I tried to use Nutch-1.3 to crawl a web site which is protected by "Basic 
HTTP authorize" , but found that the nutch did not crawled anything after it 
finish running .After check the hudoop.log , I got some information bellow :
2011-09-07 04:11:37,539 WARN  crawl.Generator - Generator: 0 records selected 
for fetching, exiting ...
2011-09-07 04:11:37,541 INFO  crawl.Crawl - Stopping at depth=1 - no more URLs 
to fetch.
I tried to find answer by Google, but got no useful information.
B.So , I change the URL to a public site (such as www.yahoo.com) and run the 
nutch crawl again , this time the nutch worked well - all page were crawled and 
indexed into solr
2. Configurations - the only difference of configuration files for the 2 
operations is :
for operationA the plugin.includes's value is 
:protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
for operationB the plugin.includes's value is 
:protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)A.
 nutch-site.xml
<property>
  <name>plugin.includes</name>
  
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description></description>
</property>
B. httpclient-auth.xml
<auth-configuration>
<credentials username="user" password="password">
      <default/>
</credentials>
</auth-configuration>
C. regex-urlfilter.txt
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-[?*!@=]
+.
That's all configurations and operations i used, but for the site protected by 
"Basic HTTP authorize" i always got the error message .
Could someone help me on this ?

Thanks a lot ~

//BR

Reply via email to