[jira] Closed: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-04-03 Thread Andrzej Bialecki (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-230?page=all ]
 
Andrzej Bialecki  closed NUTCH-230:
---

Resolution: Fixed

Patch applied.

 OPIC score for outlinks should be based on # of valid links, not total # of 
 links.
 --

  Key: NUTCH-230
  URL: http://issues.apache.org/jira/browse/NUTCH-230
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Ken Krugler
 Priority: Minor
  Attachments: patch.txt

 In ParseOutputFormat.java, the write() method currently divides the page 
 score by the # of outlinks:
   score /= links.length;
 It then loops over the links, and any that pass the normalize/filter gauntlet 
 get added to the crawl output.
 But this means that any filtered links result in some amount of the page's 
 OPIC score being lost.
 For Nutch 0.7, I built a list of valid (post-filter) links, and then used 
 that to determine the per-link OPIC score, after which I iterated over the 
 list, adding entries to the crawl output.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372948 ] 

Jerome Charron commented on NUTCH-240:
--

+1

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] 

Doug Cutting commented on NUTCH-240:


+1 for committing Generator.patch.txt now.

0 for committing the rest until I've had more time to think about it.  I'm not 
against it, but, at a glance, I'm still hopeful we can do better.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] 

Doug Cutting commented on NUTCH-240:


Also, note that we can now extend Hadoop's new MapReduceBase to implement 
configure() and close() for many Mappers and Reducers, including the one's in 
this patch.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira