It's because the (ugly) url instance you gave does not match the RE. You can check it here http://regexpal.com/.
--Sudip. On Sun, Nov 13, 2011 at 4:16 AM, Xiao Li <[email protected]> wrote: > Hi Nutch people, > > > I am a newbie to Nutch. I am doing a project to crawl all webpages for > several universities. > > I have a problem right now. I find my crawler stuck in an infinite loop to > repeatedly download junk pages (page not found). I checked the urls in > crawl_generate directory and find this dirty stuff: > > http://grad.uwo.ca/js/prospective_students/prospective_students/current_students/postdoctoral/current_students/current_students/prospective_students/prospective_students/prospective_students/current_students/current_students/international.htm > > I do not change anything inside the regex-urlfilter.txt. The original regex > to avoid loop is -.*(/[^/]+)/[^/]+\1/[^/]+\1/. But why does Nutch still > fetch such ugly urls? Please help! > > cheers > Xiao >

