[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060637#comment-13060637 ] Julien Nioche commented on NUTCH-1011: -- great. +1 to commit Normalize duplicate slashes in URL's Key: NUTCH-1011 URL: https://issues.apache.org/jira/browse/NUTCH-1011 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.4, 2.0 Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///1.x/dynamic.html This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061041#comment-13061041 ] Hudson commented on NUTCH-1011: --- Integrated in Nutch-trunk #1538 (See [https://builds.apache.org/job/Nutch-trunk/1538/]) NUTCH-1011 Remove duplicate slashes from URLs markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1143468 Files : * /nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java * /nutch/trunk/conf/regex-normalize.xml.template * /nutch/trunk/CHANGES.txt Normalize duplicate slashes in URL's Key: NUTCH-1011 URL: https://issues.apache.org/jira/browse/NUTCH-1011 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.4, 2.0 Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///1.x/dynamic.html This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060073#comment-13060073 ] Julien Nioche commented on NUTCH-1011: -- Is this case covered by the tests in org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer? Normalize duplicate slashes in URL's Key: NUTCH-1011 URL: https://issues.apache.org/jira/browse/NUTCH-1011 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.4, 2.0 Attachments: NUTCH-1011-all-3.patch Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///1.x/dynamic.html This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054438#comment-13054438 ] Markus Jelsma commented on NUTCH-1011: -- This normalizer works with NUTCH-1013. {code} !-- removes duplicate slashes -- regex pattern(?lt;!:)/{2,}/pattern substitution//substitution /regex {code} Normalize duplicate slashes in URL's Key: NUTCH-1011 URL: https://issues.apache.org/jira/browse/NUTCH-1011 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Attachments: NUTCH-1011-all-3.patch Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///1.x/dynamic.html This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's
[ https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053953#comment-13053953 ] Markus Jelsma commented on NUTCH-1011: -- Oh, it gets better. It seems the used engine cannot deal with my regex? regex.RegexURLNormalizer - error parsing conf file: org.apache.oro.text.regex.MalformedPatternException: Sequence (?...) not recognized Normalize duplicate slashes in URL's Key: NUTCH-1011 URL: https://issues.apache.org/jira/browse/NUTCH-1011 Project: Nutch Issue Type: Improvement Affects Versions: 1.4, 2.0 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Attachments: NUTCH-1011-all-3.patch Many websites produce faulty URL's with multiple slashes e.g. http://cocoon.apache.org///1.x/dynamic.html This can be really nasty if the number of slashes varies, resulting in many URL's actually pointing to the same page and generating new (unique) URL's to the same or other duplicate pages. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira