[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514592
 ] 

Doğacan Güney commented on NUTCH-518:
-

> My objection to the original patch was specifically what the 
> OPICScoringFilter should do in this case, and not what any 
> ScoringFilter should be able to do.

Two common use cases seem to be (besides opic):

1) Boosters: Boosters check for a specific pattern and if the pattern exists 
adds a small boost to the original score. 

2) Sinks: These are commonly used to restrict crawls to specific domains. They 
are used to "pull" the page's score to zero so that page is never fetched as 
long as there are other pages to fetch.

If opic adds: 

1st use case) returns 0 if it doesn't want to boost, or a positive value

2nd use case) returns a negative number to sink or 0.

if opic multiplies:

1st use case) returns 1 if it doesn't want to boost or >1.

2nd use case) returns 0 to sink or 1.

As far as I'm concerned, it doesn't matter if opic multiplies or adds. Other 
scoring filters will have to change their behaviour to work with opic in both 
cases anyway. 

(FWIW, I think multiplication is a tiny bit more elegant. I like the idea of 
returning 0 to sink better than returning negative numbers.)

> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513853
 ] 

Andrzej Bialecki  commented on NUTCH-518:
-

IMHO this change is not helpful. It takes away too much control from 
ScoringFilter implementations, and replaces it with a fixed set of options for 
score caclulation, which sort of defeats the purpose of this API.

My objection to the original patch was specifically what the OPICScoringFilter 
should do in this case, and not what any ScoringFilter should be able to do.

> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513826
 ] 

Enis Soztutar commented on NUTCH-518:
-

> I think removing initial score arguments and merging scores in 
> ScoringFilters.java is a good idea overall
+1 for this one. The final score should be calculated centrally. Maybe we may 
implement more than one way to calculate the score. Roughly ; 

ScoringFilters.getMultipliedScore()
ScoringFilters.getSummedScore() 
ScoringFilters.getGeometricMeanScore()


> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513823
 ] 

Doğacan Güney commented on NUTCH-518:
-

Btw, I think removing initial score arguments and merging scores in 
ScoringFilters.java is a good idea overall. It removes some control from 
ScoringFilter-s, but it also eliminates cases where one ScoringFilter 
multiplies its score with initial score, the other adds it and the last one 
ignores it.

> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513821
 ] 

Doğacan Güney commented on NUTCH-518:
-

This is another alternative. I am not suggesting that we use it but just to put 
it on the table:

* Remove initial score argument from indexerScore and generatorSortValue.
* Change ScoringFilters.java to collect scores from different ScoringFilter-s.
* Calculate their geometric mean.

This approach is far more aggressive. It is like a logical AND. With geometric 
mean a page is 'important' pretty much only if *all* scoring filters decide 
that it is important. I really like this approach, but it won't work for people 
who want to give a high score to pages with certain content even if the page 
itself has no inlinks (for this case, addition would have worked very well).



> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Enis Soztutar (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513819
 ] 

Enis Soztutar commented on NUTCH-518:
-

Since there is no ordering among scoring filters, if we do something specific 
to zero in OpicScoring, it might lead to nondeterministic behaviour. Let's say  
for example the code in OpicScoring is changed so that : 

public float indexerScore(Text url, Document doc, CrawlDatum dbDatum, 
CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) {
   if(initScore != 0)
  return (float)Math.pow(dbDatum.getScore(), scorePower) * initScore;
   else 
   //do smt nasty
}

then there will be a big difference if scoring-opic is run before or after 
scoring-foo. 
As far as i can tell from the massages in mailing lists, scoring filters are 
used for restricting the crawl to topics, so zero-handling might broke 
topic-specific crawls. So my vote is to keep current implementation. 

> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513807
 ] 

Hudson commented on NUTCH-518:
--

Integrated in Nutch-Nightly #154 (See 
[http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/154/])

> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513704
 ] 

Andrzej Bialecki  commented on NUTCH-518:
-

Right, I was too quick too ... ;) Leave it in for now. Let's agree first on 
what is the right way to do this.

> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining

2007-07-18 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513679
 ] 

Doğacan Güney commented on NUTCH-518:
-

Sure. I thought you are OK with it since you mentioned you are going to commit 
it in NUTCH-439. Should I revert the commit or leave it in for now?

> Fix OpicScoringFilter to respect scoring filter chaining
> 
>
> Key: NUTCH-518
> URL: https://issues.apache.org/jira/browse/NUTCH-518
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.0.0
>Reporter: Enis Soztutar
>Assignee: Doğacan Güney
> Fix For: 1.0.0
>
> Attachments: opicScoring.chain.patch
>
>
> Opic Scoring returns the score that it calculates, rather than returning 
> previous_score * calculated_score. This prevents using another scoring filter 
> along with Opic scoring. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers