[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

2011-07-06 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060637#comment-13060637
 ] 

Julien Nioche commented on NUTCH-1011:
--

great. +1 to commit

 Normalize duplicate slashes in URL's
 

 Key: NUTCH-1011
 URL: https://issues.apache.org/jira/browse/NUTCH-1011
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch


 Many websites produce faulty URL's with multiple slashes e.g. 
 http://cocoon.apache.org///1.x/dynamic.html
 This can be really nasty if the number of slashes varies, resulting in many 
 URL's actually pointing to the same page and generating new (unique) URL's to 
 the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

2011-07-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13061041#comment-13061041
 ] 

Hudson commented on NUTCH-1011:
---

Integrated in Nutch-trunk #1538 (See 
[https://builds.apache.org/job/Nutch-trunk/1538/])
NUTCH-1011 Remove duplicate slashes from URLs

markus : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1143468
Files : 
* /nutch/trunk/src/test/org/apache/nutch/net/TestURLNormalizers.java
* /nutch/trunk/conf/regex-normalize.xml.template
* /nutch/trunk/CHANGES.txt


 Normalize duplicate slashes in URL's
 

 Key: NUTCH-1011
 URL: https://issues.apache.org/jira/browse/NUTCH-1011
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1011-1.4-2.patch, NUTCH-1011-all-3.patch


 Many websites produce faulty URL's with multiple slashes e.g. 
 http://cocoon.apache.org///1.x/dynamic.html
 This can be really nasty if the number of slashes varies, resulting in many 
 URL's actually pointing to the same page and generating new (unique) URL's to 
 the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

2011-07-05 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13060073#comment-13060073
 ] 

Julien Nioche commented on NUTCH-1011:
--

Is this case covered by the tests in 
org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer?


 Normalize duplicate slashes in URL's
 

 Key: NUTCH-1011
 URL: https://issues.apache.org/jira/browse/NUTCH-1011
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.4, 2.0

 Attachments: NUTCH-1011-all-3.patch


 Many websites produce faulty URL's with multiple slashes e.g. 
 http://cocoon.apache.org///1.x/dynamic.html
 This can be really nasty if the number of slashes varies, resulting in many 
 URL's actually pointing to the same page and generating new (unique) URL's to 
 the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

2011-06-24 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13054438#comment-13054438
 ] 

Markus Jelsma commented on NUTCH-1011:
--

This normalizer works with NUTCH-1013.
 
{code}
!-- removes duplicate slashes --
regex
  pattern(?lt;!:)/{2,}/pattern
  substitution//substitution
/regex
{code}

 Normalize duplicate slashes in URL's
 

 Key: NUTCH-1011
 URL: https://issues.apache.org/jira/browse/NUTCH-1011
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Attachments: NUTCH-1011-all-3.patch


 Many websites produce faulty URL's with multiple slashes e.g. 
 http://cocoon.apache.org///1.x/dynamic.html
 This can be really nasty if the number of slashes varies, resulting in many 
 URL's actually pointing to the same page and generating new (unique) URL's to 
 the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1011) Normalize duplicate slashes in URL's

2011-06-23 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13053953#comment-13053953
 ] 

Markus Jelsma commented on NUTCH-1011:
--

Oh, it gets better. It seems the used engine cannot deal with my regex?

regex.RegexURLNormalizer - error parsing conf file: 
org.apache.oro.text.regex.MalformedPatternException: Sequence (?...) not 
recognized

 Normalize duplicate slashes in URL's
 

 Key: NUTCH-1011
 URL: https://issues.apache.org/jira/browse/NUTCH-1011
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.4, 2.0
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Attachments: NUTCH-1011-all-3.patch


 Many websites produce faulty URL's with multiple slashes e.g. 
 http://cocoon.apache.org///1.x/dynamic.html
 This can be really nasty if the number of slashes varies, resulting in many 
 URL's actually pointing to the same page and generating new (unique) URL's to 
 the same or other duplicate pages.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira