I am seeing an issue with crawling html pages that have relative urls
embedded in them.  I know there is an ongoing issue related to relative urls
that begin with a ?. But this seems to be a different issue.

In regex-normalize.xml there is the following pattern:

<regex>
  <pattern>#.*?(\?|&amp;|$)</pattern>
  <subsitution>$1</subsitution>
</regex>

Here is my url:
http://myhost.com/my_page.php?id=23141

the source of this page contains the following href:
href="#R&amp;D_in_Research_Books"

it tries to fetch this url:
http://myhost.com/my_page.php?id=23141&D_in_Research_Books&D&D_in_Research_Books&D&D&D_in_Research_Books&D_in_Research_Books

WTH??? Commenting out that pattern stops the madness, otherwise it runs in a
continual loop and never ends, just keeps generating more and more urls with
the "&D_in_Research_Books" tacked onto the end. 

I have over 1 MILLION of these in my crawldb (it has been running for over a
week).





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Relative-urls-interpage-href-anchors-tp3861215p3861215.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to