[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-30 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12377200 ] 

Andrzej Bialecki  commented on NUTCH-240:
-

If there are no further suggestions or objections, I'd like to move forward on 
this patch. I know the passScore* methods are a bit awkward, but that's what we 
do anyway, we just do it under the carpet :)

If folks have ideas how to improve this part, please speak up.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt, patch2.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-05 Thread Shawn Gervais (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373256 ] 

Shawn Gervais commented on NUTCH-240:
-

This change seems to have caused an error to be thrown:

060405 034711 Generator: Partitioning selected urls by host, for politeness.
Exception in thread main java.lang.RuntimeException: class 
org.apache.nutch.crawl.Generator$SelectorInverseMapper not 
org.apache.hadoop.mapred.Mapper
at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:262)
at org.apache.hadoop.mapred.JobConf.setMapperClass(JobConf.java:249)
at org.apache.nutch.crawl.Generator.generate(Generator.java:263)
at org.apache.nutch.crawl.Generator.main(Generator.java:317)

Just FYI.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-05 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12373264 ] 

Andrzej Bialecki  commented on NUTCH-240:
-

Oops, sorry, that was a last moment change ... I fixed it now, thanks for 
spotting this.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Jerome Charron (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372948 ] 

Jerome Charron commented on NUTCH-240:
--

+1

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372979 ] 

Doug Cutting commented on NUTCH-240:


+1 for committing Generator.patch.txt now.

0 for committing the rest until I've had more time to think about it.  I'm not 
against it, but, at a glance, I'm still hopeful we can do better.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-04-03 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372981 ] 

Doug Cutting commented on NUTCH-240:


Also, note that we can now extend Hadoop's new MapReduceBase to implement 
configure() and close() for many Mappers and Reducers, including the one's in 
this patch.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement

 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
 Assignee: Andrzej Bialecki 
  Attachments: Generator.patch.txt, patch.txt, patch1.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-30 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372574 ] 

Doug Cutting commented on NUTCH-240:


First, I hope my critical remarks were not taken personally.  I am thankful for 
this and all of your contributions.

 Initially, I did as you suggest, i.e. I created a method to calculate one 
 float value for the purpose of selecting topN. However, I wanted to avoid 
 changing CrawlDatum.compareTo - if we put ScoringFilters there, it would be a 
 big performance hit. OTOH, if we overwrite the primitive float in 
 CrawlDatum.score it seemed to me we should store its earlier value, and then 
 possibl restore - as the value for selecting topN may have nothing to do with 
 the real score. 

In Generate.java, can't we just change the key type in the first pass to be a 
FloatWritable holding the score, and the value to be CrawlDatum,Url?  Then 
we'd never alter the CrawlDatum and there'd be no need to restore it.

 passScoreBeforeParsing/passScoreAfterParsing: again, I agree it looks 
 strange, but that's what we do at the moment, I just extracted it into an 
 interface. I'd love to skip this altogether, if there is a way.

I think we should spend a little more time thinking about how to make this a 
nice API before we start having folks implement it.  Once an interface is 
added, it's much harder to change.  I don't have much time to spend on this 
today, but might next week.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: patch.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-29 Thread Doug Cutting (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372341 ] 

Doug Cutting commented on NUTCH-240:


The generator store/restore score stuff seems ugly.  And it is not used by 
OPIC.  Could we insteadhave a method that computes and returns a score to be 
used by the generator?  Then it is up to the generator to use this w/o 
modifying the CrawlDatum.

The passScoreBeforeParsing/passScoreAfterParsing/distributeScoreToOutlink 
protocol also seems awkward, although I don't yet have a suggestion for how to 
improve it.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: patch.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-240) Scoring API: extension point, scoring filters and an OPIC plugin

2006-03-29 Thread Andrzej Bialecki (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-240?page=comments#action_12372379 ] 

Andrzej Bialecki  commented on NUTCH-240:
-

Yes, one of the reasons I wanted to discuss these patches is that they 
uncovered some of the underlying ugliness... ;)

The reson for generator store/restore is that scoring plugins could take into 
account many more variables than just the score recorded in CrawlDatum.score. 
They could also have different strategies for prioritizing pages to be included 
in topN.

So, it's true this is not currently used by OPIC but I think without this it's 
not possible for plugins to affect the choice of topN.

Initially, I did as you suggest, i.e. I created a method to calculate one float 
value for the purpose of selecting topN. However, I wanted to avoid changing 
CrawlDatum.compareTo - if we put ScoringFilters there, it would be a big 
performance hit. OTOH, if we overwrite the primitive float in CrawlDatum.score 
it seemed to me we should store its earlier value, and then possibl restore - 
as the value for selecting topN may have nothing to do with the real score.

passScoreBeforeParsing/passScoreAfterParsing: again, I agree it looks strange, 
but that's what we do at the moment, I just extracted it into an interface. I'd 
love to skip this altogether, if there is a way.

 Scoring API: extension point, scoring filters and an OPIC plugin
 

  Key: NUTCH-240
  URL: http://issues.apache.org/jira/browse/NUTCH-240
  Project: Nutch
 Type: Improvement
 Versions: 0.8-dev
 Reporter: Andrzej Bialecki 
  Attachments: patch.txt

 This patch refactors all places where Nutch manipulates page scores, into a 
 plugin-based API. Using this API it's possible to implement different scoring 
 algorithms. It is also much easier to understand how scoring works.
 Multiple scoring plugins can be run in sequence, in a manner similar to 
 URLFilters.
 Included is also an OPICScoringFilter plugin, which contains the current 
 implementation of the scoring algorithm. Together with the scoring API it 
 provides a fully backward-compatible scoring.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira