Hi Nutch people,
I am a newbie to Nutch. I am doing a project to crawl all webpages for several universities. I have a problem right now. I find my crawler stuck in an infinite loop to repeatedly download junk pages (page not found). I checked the urls in crawl_generate directory and find this dirty stuff: http://grad.uwo.ca/js/prospective_students/prospective_students/current_students/postdoctoral/current_students/current_students/prospective_students/prospective_students/prospective_students/current_students/current_students/international.htm I do not change anything inside the regex-urlfilter.txt. The original regex to avoid loop is -.*(/[^/]+)/[^/]+\1/[^/]+\1/. But why does Nutch still fetch such ugly urls? Please help! cheers Xiao

