[jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash
[ https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12580677#action_12580677 ] Hudson commented on NUTCH-620: -- Integrated in Nutch-trunk #395 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/395/]) BasicURLNormalizer should collapse runs of slashes with a single slash -- Key: NUTCH-620 URL: https://issues.apache.org/jira/browse/NUTCH-620 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, Reporter: Mark DeSpain Assignee: Andrzej Bialecki Priority: Minor Fix For: 1.0.0 Attachments: patch.txt Original Estimate: 0.5h Remaining Estimate: 0.5h The BasicURLNormalizer should collapse runs of slash characters '/' with a single slash. For example, the following URLs should be normalized to http://lucene.apache.org/nutch/about.html * http://lucene.apache.org/nutch//about.html * http://lucene.apache.org//nutch/about.html * http://lucene.apache.org/nutchabout.html (an exaggerated example) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash
[ https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579321#action_12579321 ] Mark DeSpain commented on NUTCH-620: Hi Andrzej, Though I'm very interested in using and learning more about Nutch, I'm still very much new to it. Please let me know if I have a flawed understand of the behavior I describe below. Yesterday when I had Nutch perform a crawl a site within our intranet, it appeared that Nutch re-visited pages multiple times. Also, log entries of the fetched URLs for the repeatedly visited pages would have more and more slashes in them. Using the URL I posted earlier as an example, I would first see the clean URL http://lucene.apache.org/nutch/about.html logged. Then later I would see a progression similar to the following http://lucene.apache.org//nutch/about.html http://lucene.apache.org///nutch/about.html http://lucene.apache.orgnutch/about.html http://lucene.apache.org/nutch/about.html I have not debugged this to be sure, but my guess is that there is a web page with a relative URL back to a parent page which has an extra slash in it, something like ..//../index.html. Aside from the fact that the the web developer really should clean up the URL, one would hope that a dirty URL like the one describled would not cause the crawl to re-visit a graph of pages. If this issue is actually the cause of the behavior I observed, avoiding the re-visitation of a graph of pages is probably worth this extra step in the normalization process. Again, please let me know if I have a misunderstanding of how Nutch is supposed to perform its crawl. I would also be happy to provide more debugging information if needed. BasicURLNormalizer should collapse runs of slashes with a single slash -- Key: NUTCH-620 URL: https://issues.apache.org/jira/browse/NUTCH-620 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, Reporter: Mark DeSpain Priority: Minor Fix For: 1.0.0 Original Estimate: 0.5h Remaining Estimate: 0.5h The BasicURLNormalizer should collapse runs of slash characters '/' with a single slash. For example, the following URLs should be normalized to http://lucene.apache.org/nutch/about.html * http://lucene.apache.org/nutch//about.html * http://lucene.apache.org//nutch/about.html * http://lucene.apache.org/nutchabout.html (an exaggerated example) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash
[ https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579438#action_12579438 ] Andrzej Bialecki commented on NUTCH-620: - It would be interesting to see the source HTML, which causes these links to appear ... I think your point is valid, Nutch should collapse such adjacent slashes. Could you provide a patch to BasicURLNormalizer that implements this rule? BasicURLNormalizer should collapse runs of slashes with a single slash -- Key: NUTCH-620 URL: https://issues.apache.org/jira/browse/NUTCH-620 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, Reporter: Mark DeSpain Priority: Minor Fix For: 1.0.0 Original Estimate: 0.5h Remaining Estimate: 0.5h The BasicURLNormalizer should collapse runs of slash characters '/' with a single slash. For example, the following URLs should be normalized to http://lucene.apache.org/nutch/about.html * http://lucene.apache.org/nutch//about.html * http://lucene.apache.org//nutch/about.html * http://lucene.apache.org/nutchabout.html (an exaggerated example) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash
[ https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579707#action_12579707 ] Mark DeSpain commented on NUTCH-620: Sure :) I'm a bit swamped at the moment, but I'll try to get a patch attached this coming weekend. I'll see if I can drum up some relevant HTML source, too. BasicURLNormalizer should collapse runs of slashes with a single slash -- Key: NUTCH-620 URL: https://issues.apache.org/jira/browse/NUTCH-620 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, Reporter: Mark DeSpain Priority: Minor Fix For: 1.0.0 Original Estimate: 0.5h Remaining Estimate: 0.5h The BasicURLNormalizer should collapse runs of slash characters '/' with a single slash. For example, the following URLs should be normalized to http://lucene.apache.org/nutch/about.html * http://lucene.apache.org/nutch//about.html * http://lucene.apache.org//nutch/about.html * http://lucene.apache.org/nutchabout.html (an exaggerated example) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.