Author: ab Date: Fri Sep 22 14:43:01 2006 New Revision: 449100 URL: http://svn.apache.org/viewvc?view=rev&rev=449100 Log: NUTCH-332: fix the problem of doubling scores caused by links pointing to the current page (e.g. anchors).
Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java Modified: lucene/nutch/branches/branch-0.8/CHANGES.txt URL: http://svn.apache.org/viewvc/lucene/nutch/branches/branch-0.8/CHANGES.txt?view=diff&rev=449100&r1=449099&r2=449100 ============================================================================== --- lucene/nutch/branches/branch-0.8/CHANGES.txt (original) +++ lucene/nutch/branches/branch-0.8/CHANGES.txt Fri Sep 22 14:43:01 2006 @@ -28,6 +28,9 @@ 9. Use a CombiningCollector when calculating readdb -stats. This drastically reduces the size of intermediate data, resulting in significant speed-ups for large databases (ab) + +10. NUTCH-332 - Fix doubling score caused by links to self (Stefan + Groschupf via ab) Release 0.8 - 2006-07-25 Modified: lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java URL: http://svn.apache.org/viewvc/lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=diff&rev=449100&r1=449099&r2=449100 ============================================================================== --- lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java (original) +++ lucene/nutch/branches/branch-0.8/src/java/org/apache/nutch/parse/ParseOutputFormat.java Fri Sep 22 14:43:01 2006 @@ -121,6 +121,8 @@ } catch (Exception e) { toUrl = null; } + // ignore links to self (or anchors within the page) + if (fromUrl.equals(toUrl)) toUrl = null; if (toUrl != null) validCount++; toUrls[i] = toUrl; }