It's because the (ugly) url instance you gave does not match the RE.
You can check it here http://regexpal.com/.

--Sudip.

On Sun, Nov 13, 2011 at 4:16 AM, Xiao Li <[email protected]> wrote:
> Hi Nutch people,
>
>
> I am a newbie to Nutch. I am doing a project to crawl all webpages for
> several universities.
>
> I have a problem right now. I find my crawler stuck in an infinite loop to
> repeatedly download junk pages (page not found). I checked the urls in
> crawl_generate directory and find this dirty stuff:
>
> http://grad.uwo.ca/js/prospective_students/prospective_students/current_students/postdoctoral/current_students/current_students/prospective_students/prospective_students/prospective_students/current_students/current_students/international.htm
>
> I do not change anything inside the regex-urlfilter.txt. The original regex
> to avoid loop is -.*(/[^/]+)/[^/]+\1/[^/]+\1/. But why does Nutch still
> fetch such ugly urls? Please help!
>
> cheers
> Xiao
>

Reply via email to