I am seeing an issue with crawling html pages that have relative urls embedded in them. I know there is an ongoing issue related to relative urls that begin with a ?. But this seems to be a different issue.
In regex-normalize.xml there is the following pattern: <regex> <pattern>#.*?(\?|&|$)</pattern> <subsitution>$1</subsitution> </regex> Here is my url: http://myhost.com/my_page.php?id=23141 the source of this page contains the following href: href="#R&D_in_Research_Books" it tries to fetch this url: http://myhost.com/my_page.php?id=23141&D_in_Research_Books&D&D_in_Research_Books&D&D&D_in_Research_Books&D_in_Research_Books WTH??? Commenting out that pattern stops the madness, otherwise it runs in a continual loop and never ends, just keeps generating more and more urls with the "&D_in_Research_Books" tacked onto the end. I have over 1 MILLION of these in my crawldb (it has been running for over a week). -- View this message in context: http://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html Sent from the Nutch - User mailing list archive at Nabble.com.