Hello

I use nutch-1.2 with fedora 14 and try to index about 4000 domains. I use 
bin/nutch crawl urls -dir crawl -depth 3 topN -1 and have in 
crawl-urlfilter.txt this
# accept hosts in MY.DOMAIN.NAME
 +^http://([a-z0-9]*\.)* 

I noticed that if a domain has entered like http://mydomain.com in the seed 
file, nutch gives error
failed with: java.net.UnknownHostException for some domains.

If, however, I enter the same domain with www like http://www.mydomain.com 
nutch does not give any errors.

Since, if we enter the http://mydomain.com in the browser it redirects to 
http://www.mydomain.com
I thought this might be a bug in nutch.

Any thoughts how to fix this issue?

Thanks.
Alex.

Reply via email to