[
https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579321#action_12579321
]
Mark DeSpain commented on NUTCH-620:
Hi Andrzej,
Though I'm very interested in using and learning more about Nutch, I'm still
very much new to it. Please let me know if I have a flawed understand of the
behavior I describe below.
Yesterday when I had Nutch perform a crawl a site within our intranet, it
appeared that Nutch re-visited pages multiple times. Also, log entries of the
fetched URLs for the repeatedly visited pages would have more and more
slashes in them. Using the URL I posted earlier as an example, I would first
see the clean URL http://lucene.apache.org/nutch/about.html logged.
Then later I would see a progression similar to the following
http://lucene.apache.org//nutch/about.html
http://lucene.apache.org///nutch/about.html
http://lucene.apache.orgnutch/about.html
http://lucene.apache.org/nutch/about.html
I have not debugged this to be sure, but my guess is that there is a web page
with a relative URL back to a parent page which has an extra slash in it,
something like ..//../index.html. Aside from the fact that the the web
developer really should clean up the URL, one would hope that a dirty URL like
the one describled would not cause the crawl to re-visit a graph of pages.
If this issue is actually the cause of the behavior I observed, avoiding the
re-visitation of a graph of pages is probably worth this extra step in the
normalization process.
Again, please let me know if I have a misunderstanding of how Nutch is supposed
to perform its crawl. I would also be happy to provide more debugging
information if needed.
BasicURLNormalizer should collapse runs of slashes with a single slash
--
Key: NUTCH-620
URL: https://issues.apache.org/jira/browse/NUTCH-620
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 0.9.0
Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003,
Reporter: Mark DeSpain
Priority: Minor
Fix For: 1.0.0
Original Estimate: 0.5h
Remaining Estimate: 0.5h
The BasicURLNormalizer should collapse runs of slash characters '/' with a
single slash.
For example, the following URLs should be normalized to
http://lucene.apache.org/nutch/about.html
* http://lucene.apache.org/nutch//about.html
* http://lucene.apache.org//nutch/about.html
* http://lucene.apache.org/nutchabout.html (an exaggerated example)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.