infinite loop when fetching

Xiao Li Sat, 12 Nov 2011 15:17:15 -0800

Hi Nutch people,


I am a newbie to Nutch. I am doing a project to crawl all webpages for
several universities.

I have a problem right now. I find my crawler stuck in an infinite loop to
repeatedly download junk pages (page not found). I checked the urls in
crawl_generate directory and find this dirty stuff:

http://grad.uwo.ca/js/prospective_students/prospective_students/current_students/postdoctoral/current_students/current_students/prospective_students/prospective_students/prospective_students/current_students/current_students/international.htm

I do not change anything inside the regex-urlfilter.txt. The original regex
to avoid loop is -.*(/[^/]+)/[^/]+\1/[^/]+\1/. But why does Nutch still
fetch such ugly urls? Please help!

cheers
Xiao

infinite loop when fetching

Reply via email to