Hi Yongyao, I haven't tried to combine both scoring filter plugins and I don't know whether they work together well. The ScoringFilter interface is designed so that all methods have access to the score previously calculated by the filters in the chain. In doubt, check the implementations, try it and we are glad to hear how to make the scoring filters cooperate. A focused crawler which its based on both content and link structure may be even better. But I expect that adjusting everything can be subtle.
> of indexing (indexerscore())? Here I see no problem. You have to look esp. on the methods passScoreAfterParsing(...) distributeScoreToOutlinks(...) Looks like that the plugin called second (cf. scoring.filter.order) overwrites any values set before. Best, Sebastian On 04/25/2017 09:41 PM, Yongyao Jiang wrote: > Thanks, Sebastian. That makes sense. Just a follow-up question, if I want > to combine the OPIC score and the similarity score, how shall I do it? > Maybe I am wrong, I don't think just putting > scoring-opic|scoring-similarity can do this trick as there is a chance they > will be mixed up, or one gets overwritten by the other at various scoring > steps. Do I have to create two attributes for them and combine them at end > of indexing (indexerscore())? > > Yongyao > > On Sat, Apr 22, 2017 at 1:15 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi Yongyao, >> >> yes, that sounds reasonable. A simple >> return datum.getScore() * initSort; >> would do the job. That should be enough, as the similarity score is >> calculated after parsing and distributed to the outlinks. However, >> also >> updateDbScore(...) >> needs to be implemented accordingly. Otherwise the scores from outlinks >> are newer aggregated in the CrawlDb, only for newly found links the >> similarity >> score is used. The question is whether scoring-similarity wasn't designed >> to be used in combination with another scoring plugin (e.g., scoring-opic) >> which really implements these methods. >> >> Please, open an issue on Jira to discuss any questions and for >> documentation >> and release report, a PR is also welcome! >> >> Thanks, >> Sebastian >> >> On 04/18/2017 09:05 PM, Yongyao Jiang wrote: >>> Hi Sebastian, >>> >>> Yes, I understand. But when people use the similarity-scoring plugin, >> they >>> intend to do domain-specific crawling in most cases. It also means that >>> they want to control how the crawler works by adjusting the >>> generate.min.score. >>> >>> I just figured out the reason that adjusting the min value does not >> change >>> the results is that the "sort" variable in the code below always equals >> 1.0 >>> when using the scoring-similarity plugin, because this plugin doesn't >>> implement the "generatorSortValue()" function. >>> https://github.com/apache/nutch/blob/master/src/java/ >> org/apache/nutch/crawl/ >>> Generator.java#L211 >>> https://github.com/apache/nutch/blob/master/src/java/ >>> org/apache/nutch/scoring/AbstractScoringFilter.java#L40 >>> >>> I think this is supposed to be a bug. Please correct me if I am wrong. I >>> can also submit a PR if needed. >>> >>> Thanks, >>> Yongyao >>> >>> On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel < >> [email protected] >>>> wrote: >>> >>>> Hi, >>>> >>>> the scores in the index is not relevant for generating, only the scores >> in >>>> CrawlDb. >>>> The ScoringFilter interface defines a method indexerScore(...), some >>>> scoring filters >>>> return a modified (normalized) indexer score (cf. indexer.score.power). >>>> Also, changes to >>>> generate.min.score affect only which pages are fetched, pages fetched >>>> before may have a lower score. >>>> The score may also change when a page is processed (parsed, etc.) or >> even >>>> afterwards >>>> (by links pointing to it). >>>> >>>> In short: generate.min.score determines what is crawled, not what is >>>> indexed. >>>> >>>> Best, >>>> Sebastian >>>> >>>> On 04/18/2017 12:31 AM, Yongyao Jiang wrote: >>>>> Hi, >>>>> >>>>> I am using scoring-similarity plugin. After setting the >>>> generate.min.score >>>>> to 0.05, and indexing all the pages (with its score) into Elastic, I >> can >>>>> still observe many web pages whose scores are below 0.05. >>>>> >>>>> <property> >>>>> <name>generate.min.score</name> >>>>> <value>0.05</value> >>>>> <description>Select only entries with a score larger than >>>>> generate.min.score.</description> >>>>> </property> >>>>> >>>>> Below is the result of a simple aggregation of "score" in ES, >>>>> { >>>>> "key": "20170417215917", >>>>> "doc_count": 200, >>>>> "Stats": { >>>>> "count": 200, >>>>> "min": 0, >>>>> "max": 0.019184709, >>>>> "avg": 0.0012828724450000002, >>>>> "sum": 0.256574489 >>>>> } >>>>> } >>>>> >>>>> Thanks, >>>>> Yongyao >>>>> >>>> >>>> >>> >>> >> >> > >

