[jira] Closed: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.
[ http://issues.apache.org/jira/browse/NUTCH-230?page=all ] Andrzej Bialecki closed NUTCH-230: --- Resolution: Fixed Patch applied. OPIC score for outlinks should be based on # of valid links, not total # of links. -- Key: NUTCH-230 URL: http://issues.apache.org/jira/browse/NUTCH-230 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Ken Krugler Priority: Minor Attachments: patch.txt In ParseOutputFormat.java, the write() method currently divides the page score by the # of outlinks: score /= links.length; It then loops over the links, and any that pass the normalize/filter gauntlet get added to the crawl output. But this means that any filtered links result in some amount of the page's OPIC score being lost. For Nutch 0.7, I built a list of valid (post-filter) links, and then used that to determine the per-link OPIC score, after which I iterated over the list, adding entries to the crawl output. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372948 ] Jerome Charron commented on NUTCH-240: -- +1 Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] Doug Cutting commented on NUTCH-240: +1 for committing Generator.patch.txt now. 0 for committing the rest until I've had more time to think about it. I'm not against it, but, at a glance, I'm still hopeful we can do better. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] Doug Cutting commented on NUTCH-240: Also, note that we can now extend Hadoop's new MapReduceBase to implement configure() and close() for many Mappers and Reducers, including the one's in this patch. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira