[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514592 ] Doğacan Güney commented on NUTCH-518: - > My objection to the original patch was specifically what the > OPICScoringFilter should do in this case, and not what any > ScoringFilter should be able to do. Two common use cases seem to be (besides opic): 1) Boosters: Boosters check for a specific pattern and if the pattern exists adds a small boost to the original score. 2) Sinks: These are commonly used to restrict crawls to specific domains. They are used to "pull" the page's score to zero so that page is never fetched as long as there are other pages to fetch. If opic adds: 1st use case) returns 0 if it doesn't want to boost, or a positive value 2nd use case) returns a negative number to sink or 0. if opic multiplies: 1st use case) returns 1 if it doesn't want to boost or >1. 2nd use case) returns 0 to sink or 1. As far as I'm concerned, it doesn't matter if opic multiplies or adds. Other scoring filters will have to change their behaviour to work with opic in both cases anyway. (FWIW, I think multiplication is a tiny bit more elegant. I like the idea of returning 0 to sink better than returning negative numbers.) > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513853 ] Andrzej Bialecki commented on NUTCH-518: - IMHO this change is not helpful. It takes away too much control from ScoringFilter implementations, and replaces it with a fixed set of options for score caclulation, which sort of defeats the purpose of this API. My objection to the original patch was specifically what the OPICScoringFilter should do in this case, and not what any ScoringFilter should be able to do. > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513826 ] Enis Soztutar commented on NUTCH-518: - > I think removing initial score arguments and merging scores in > ScoringFilters.java is a good idea overall +1 for this one. The final score should be calculated centrally. Maybe we may implement more than one way to calculate the score. Roughly ; ScoringFilters.getMultipliedScore() ScoringFilters.getSummedScore() ScoringFilters.getGeometricMeanScore() > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513823 ] Doğacan Güney commented on NUTCH-518: - Btw, I think removing initial score arguments and merging scores in ScoringFilters.java is a good idea overall. It removes some control from ScoringFilter-s, but it also eliminates cases where one ScoringFilter multiplies its score with initial score, the other adds it and the last one ignores it. > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513821 ] Doğacan Güney commented on NUTCH-518: - This is another alternative. I am not suggesting that we use it but just to put it on the table: * Remove initial score argument from indexerScore and generatorSortValue. * Change ScoringFilters.java to collect scores from different ScoringFilter-s. * Calculate their geometric mean. This approach is far more aggressive. It is like a logical AND. With geometric mean a page is 'important' pretty much only if *all* scoring filters decide that it is important. I really like this approach, but it won't work for people who want to give a high score to pages with certain content even if the page itself has no inlinks (for this case, addition would have worked very well). > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513819 ] Enis Soztutar commented on NUTCH-518: - Since there is no ordering among scoring filters, if we do something specific to zero in OpicScoring, it might lead to nondeterministic behaviour. Let's say for example the code in OpicScoring is changed so that : public float indexerScore(Text url, Document doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) { if(initScore != 0) return (float)Math.pow(dbDatum.getScore(), scorePower) * initScore; else //do smt nasty } then there will be a big difference if scoring-opic is run before or after scoring-foo. As far as i can tell from the massages in mailing lists, scoring filters are used for restricting the crawl to topics, so zero-handling might broke topic-specific crawls. So my vote is to keep current implementation. > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513807 ] Hudson commented on NUTCH-518: -- Integrated in Nutch-Nightly #154 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/154/]) > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513704 ] Andrzej Bialecki commented on NUTCH-518: - Right, I was too quick too ... ;) Leave it in for now. Let's agree first on what is the right way to do this. > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Commented: (NUTCH-518) Fix OpicScoringFilter to respect scoring filter chaining
[ https://issues.apache.org/jira/browse/NUTCH-518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12513679 ] Doğacan Güney commented on NUTCH-518: - Sure. I thought you are OK with it since you mentioned you are going to commit it in NUTCH-439. Should I revert the commit or leave it in for now? > Fix OpicScoringFilter to respect scoring filter chaining > > > Key: NUTCH-518 > URL: https://issues.apache.org/jira/browse/NUTCH-518 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.0.0 >Reporter: Enis Soztutar >Assignee: Doğacan Güney > Fix For: 1.0.0 > > Attachments: opicScoring.chain.patch > > > Opic Scoring returns the score that it calculates, rather than returning > previous_score * calculated_score. This prevents using another scoring filter > along with Opic scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers