[jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash

2008-03-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12580677#action_12580677
 ] 

Hudson commented on NUTCH-620:
--

Integrated in Nutch-trunk #395 (See 
[http://hudson.zones.apache.org/hudson/job/Nutch-trunk/395/])

 BasicURLNormalizer should collapse runs of slashes with a single slash
 --

 Key: NUTCH-620
 URL: https://issues.apache.org/jira/browse/NUTCH-620
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, 
Reporter: Mark DeSpain
Assignee: Andrzej Bialecki 
Priority: Minor
 Fix For: 1.0.0

 Attachments: patch.txt

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The BasicURLNormalizer should collapse runs of slash characters '/' with a 
 single slash.  
 For example,  the following URLs should be normalized to 
 http://lucene.apache.org/nutch/about.html
 * http://lucene.apache.org/nutch//about.html 
 * http://lucene.apache.org//nutch/about.html 
 * http://lucene.apache.org/nutchabout.html (an exaggerated example)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash

2008-03-17 Thread Mark DeSpain (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579321#action_12579321
 ] 

Mark DeSpain commented on NUTCH-620:


Hi Andrzej,

Though I'm very interested in using and learning more about Nutch, I'm still 
very much new to it.  Please let me know if I have a flawed understand of the 
behavior I describe below.

Yesterday when I had Nutch perform a crawl a site within our intranet, it 
appeared that Nutch re-visited pages multiple times.  Also, log entries of the 
fetched URLs for the repeatedly visited pages would have more and more 
slashes in them.   Using the URL I posted earlier as an example, I would first 
see the clean URL http://lucene.apache.org/nutch/about.html logged.

Then later I would see a progression similar to the following

http://lucene.apache.org//nutch/about.html
http://lucene.apache.org///nutch/about.html
http://lucene.apache.orgnutch/about.html
http://lucene.apache.org/nutch/about.html

I have not debugged this to be sure, but my guess is that there is a web page 
with a relative URL back to a parent page which has an extra slash in it, 
something like ..//../index.html.  Aside from the fact that the the web 
developer really should clean up the URL, one would hope that a dirty URL like 
the one describled would not cause the crawl to re-visit a graph of pages.

If this issue is actually the cause of the behavior I observed, avoiding the 
re-visitation of a graph of pages is probably worth this extra step in the 
normalization process.

Again, please let me know if I have a misunderstanding of how Nutch is supposed 
to perform its crawl.  I would also be happy to provide more debugging 
information if needed.


 BasicURLNormalizer should collapse runs of slashes with a single slash
 --

 Key: NUTCH-620
 URL: https://issues.apache.org/jira/browse/NUTCH-620
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, 
Reporter: Mark DeSpain
Priority: Minor
 Fix For: 1.0.0

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The BasicURLNormalizer should collapse runs of slash characters '/' with a 
 single slash.  
 For example,  the following URLs should be normalized to 
 http://lucene.apache.org/nutch/about.html
 * http://lucene.apache.org/nutch//about.html 
 * http://lucene.apache.org//nutch/about.html 
 * http://lucene.apache.org/nutchabout.html (an exaggerated example)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash

2008-03-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579438#action_12579438
 ] 

Andrzej Bialecki  commented on NUTCH-620:
-

It would be interesting to see the source HTML, which causes these links to 
appear ... I think your point is valid, Nutch should collapse such adjacent 
slashes. Could you provide a patch to BasicURLNormalizer that implements this 
rule?

 BasicURLNormalizer should collapse runs of slashes with a single slash
 --

 Key: NUTCH-620
 URL: https://issues.apache.org/jira/browse/NUTCH-620
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, 
Reporter: Mark DeSpain
Priority: Minor
 Fix For: 1.0.0

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The BasicURLNormalizer should collapse runs of slash characters '/' with a 
 single slash.  
 For example,  the following URLs should be normalized to 
 http://lucene.apache.org/nutch/about.html
 * http://lucene.apache.org/nutch//about.html 
 * http://lucene.apache.org//nutch/about.html 
 * http://lucene.apache.org/nutchabout.html (an exaggerated example)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-620) BasicURLNormalizer should collapse runs of slashes with a single slash

2008-03-17 Thread Mark DeSpain (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12579707#action_12579707
 ] 

Mark DeSpain commented on NUTCH-620:


Sure :)  I'm a bit swamped at the moment, but I'll try to get a patch attached 
this coming weekend.  I'll see if I can drum up some relevant HTML source, too.


 BasicURLNormalizer should collapse runs of slashes with a single slash
 --

 Key: NUTCH-620
 URL: https://issues.apache.org/jira/browse/NUTCH-620
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: JDK 1.6 update 5, Tomcat 6, Windows Server 2003, 
Reporter: Mark DeSpain
Priority: Minor
 Fix For: 1.0.0

   Original Estimate: 0.5h
  Remaining Estimate: 0.5h

 The BasicURLNormalizer should collapse runs of slash characters '/' with a 
 single slash.  
 For example,  the following URLs should be normalized to 
 http://lucene.apache.org/nutch/about.html
 * http://lucene.apache.org/nutch//about.html 
 * http://lucene.apache.org//nutch/about.html 
 * http://lucene.apache.org/nutchabout.html (an exaggerated example)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.