Author: ab
Date: Fri Sep 22 14:43:01 2006
New Revision: 449100

URL: http://svn.apache.org/viewvc?view=rev&rev=449100
Log:
NUTCH-332: fix the problem of doubling scores caused by links pointing
to the current page (e.g. anchors).

Modified:
    lucene/nutch/branches/branch-0.8/CHANGES.txt
    
lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java

Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/lucene/nutch/branches/branch-0.8/CHANGES.txt?view=diff&rev=449100&r1=449099&r2=449100
==============================================================================
--- lucene/nutch/branches/branch-0.8/CHANGES.txt (original)
+++ lucene/nutch/branches/branch-0.8/CHANGES.txt Fri Sep 22 14:43:01 2006
@@ -28,6 +28,9 @@
  9. Use a CombiningCollector when calculating readdb -stats. This
     drastically reduces the size of intermediate data, resulting in
     significant speed-ups for large databases (ab)
+
+10. NUTCH-332 - Fix doubling score caused by links to self (Stefan
+    Groschupf via ab)
     
 Release 0.8 - 2006-07-25
 

Modified: 
lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java
URL: 
http://svn.apache.org/viewvc/lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=diff&rev=449100&r1=449099&r2=449100
==============================================================================
--- 
lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java
 (original)
+++ 
lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java
 Fri Sep 22 14:43:01 2006
@@ -121,6 +121,8 @@
             } catch (Exception e) {
               toUrl = null;
             }
+            // ignore links to self (or anchors within the page)
+            if (fromUrl.equals(toUrl)) toUrl = null;
             if (toUrl != null) validCount++;
             toUrls[i] = toUrl;
           }


Reply via email to