[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12377200 ] Andrzej Bialecki commented on NUTCH-240: - If there are no further suggestions or objections, I'd like to move forward on this patch. I know the passScore* methods are a bit awkward, but that's what we do anyway, we just do it under the carpet :) If folks have ideas how to improve this part, please speak up. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt, patch2.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373256 ] Shawn Gervais commented on NUTCH-240: - This change seems to have caused an error to be thrown: 060405 034711 Generator: Partitioning selected urls by host, for politeness. Exception in thread main java.lang.RuntimeException: class org.apache.nutch.crawl.Generator$SelectorInverseMapper not org.apache.hadoop.mapred.Mapper at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:262) at org.apache.hadoop.mapred.JobConf.setMapperClass(JobConf.java:249) at org.apache.nutch.crawl.Generator.generate(Generator.java:263) at org.apache.nutch.crawl.Generator.main(Generator.java:317) Just FYI. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373264 ] Andrzej Bialecki commented on NUTCH-240: - Oops, sorry, that was a last moment change ... I fixed it now, thanks for spotting this. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372948 ] Jerome Charron commented on NUTCH-240: -- +1 Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] Doug Cutting commented on NUTCH-240: +1 for committing Generator.patch.txt now. 0 for committing the rest until I've had more time to think about it. I'm not against it, but, at a glance, I'm still hopeful we can do better. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] Doug Cutting commented on NUTCH-240: Also, note that we can now extend Hadoop's new MapReduceBase to implement configure() and close() for many Mappers and Reducers, including the one's in this patch. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Attachments: Generator.patch.txt, patch.txt, patch1.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372574 ] Doug Cutting commented on NUTCH-240: First, I hope my critical remarks were not taken personally. I am thankful for this and all of your contributions. Initially, I did as you suggest, i.e. I created a method to calculate one float value for the purpose of selecting topN. However, I wanted to avoid changing CrawlDatum.compareTo - if we put ScoringFilters there, it would be a big performance hit. OTOH, if we overwrite the primitive float in CrawlDatum.score it seemed to me we should store its earlier value, and then possibl restore - as the value for selecting topN may have nothing to do with the real score. In Generate.java, can't we just change the key type in the first pass to be a FloatWritable holding the score, and the value to be CrawlDatum,Url? Then we'd never alter the CrawlDatum and there'd be no need to restore it. passScoreBeforeParsing/passScoreAfterParsing: again, I agree it looks strange, but that's what we do at the moment, I just extracted it into an interface. I'd love to skip this altogether, if there is a way. I think we should spend a little more time thinking about how to make this a nice API before we start having folks implement it. Once an interface is added, it's much harder to change. I don't have much time to spend on this today, but might next week. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: patch.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372341 ] Doug Cutting commented on NUTCH-240: The generator store/restore score stuff seems ugly. And it is not used by OPIC. Could we insteadhave a method that computes and returns a score to be used by the generator? Then it is up to the generator to use this w/o modifying the CrawlDatum. The passScoreBeforeParsing/passScoreAfterParsing/distributeScoreToOutlink protocol also seems awkward, although I don't yet have a suggestion for how to improve it. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: patch.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin
[ http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372379 ] Andrzej Bialecki commented on NUTCH-240: - Yes, one of the reasons I wanted to discuss these patches is that they uncovered some of the underlying ugliness... ;) The reson for generator store/restore is that scoring plugins could take into account many more variables than just the score recorded in CrawlDatum.score. They could also have different strategies for prioritizing pages to be included in topN. So, it's true this is not currently used by OPIC but I think without this it's not possible for plugins to affect the choice of topN. Initially, I did as you suggest, i.e. I created a method to calculate one float value for the purpose of selecting topN. However, I wanted to avoid changing CrawlDatum.compareTo - if we put ScoringFilters there, it would be a big performance hit. OTOH, if we overwrite the primitive float in CrawlDatum.score it seemed to me we should store its earlier value, and then possibl restore - as the value for selecting topN may have nothing to do with the real score. passScoreBeforeParsing/passScoreAfterParsing: again, I agree it looks strange, but that's what we do at the moment, I just extracted it into an interface. I'd love to skip this altogether, if there is a way. Scoring API: extension point, scoring filters and an OPIC plugin Key: NUTCH-240 URL: http://issues.apache.org/jira/browse/NUTCH-240 Project: Nutch Type: Improvement Versions: 0.8-dev Reporter: Andrzej Bialecki Attachments: patch.txt This patch refactors all places where Nutch manipulates page scores, into a plugin-based API. Using this API it's possible to implement different scoring algorithms. It is also much easier to understand how scoring works. Multiple scoring plugins can be run in sequence, in a manner similar to URLFilters. Included is also an OPICScoringFilter plugin, which contains the current implementation of the scoring algorithm. Together with the scoring API it provides a fully backward-compatible scoring. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira