Hi Nutch people,

I am a newbie to Nutch. I am doing a project to crawl all webpages for
several universities.

I have a problem right now. I find my crawler stuck in an infinite loop to
repeatedly download junk pages (page not found). I checked the urls in
crawl_generate directory and find this dirty stuff:

http://grad.uwo.ca/js/prospective_students/prospective_students/current_students/postdoctoral/current_students/current_students/prospective_students/prospective_students/prospective_students/current_students/current_students/international.htm

I do not change anything inside the regex-urlfilter.txt. The original regex
to avoid loop is -.*(/[^/]+)/[^/]+\1/[^/]+\1/. But why does Nutch still
fetch such ugly urls? Please help!

cheers
Xiao

Reply via email to