Hi Yongyao,

yes, that sounds reasonable. A simple
  return datum.getScore() * initSort;
would do the job. That should be enough, as the similarity score is
calculated after parsing and distributed to the outlinks. However,
also
  updateDbScore(...)
needs to be implemented accordingly. Otherwise the scores from outlinks
are newer aggregated in the CrawlDb, only for newly found links the similarity
score is used. The question is whether scoring-similarity wasn't designed
to be used in combination with another scoring plugin (e.g., scoring-opic)
which really implements these methods.

Please, open an issue on Jira to discuss any questions and for documentation
and release report, a PR is also welcome!

Thanks,
Sebastian

On 04/18/2017 09:05 PM, Yongyao Jiang wrote:
> Hi Sebastian,
> 
> Yes, I understand. But when people use the similarity-scoring plugin, they
> intend to do domain-specific crawling in most cases. It also means that
> they want to control how the crawler works by adjusting the
> generate.min.score.
> 
> I just figured out the reason that adjusting the min value does not change
> the results is that the "sort" variable in the code below always equals 1.0
> when using the scoring-similarity plugin, because this plugin doesn't
> implement the "generatorSortValue()" function.
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/
> Generator.java#L211
> https://github.com/apache/nutch/blob/master/src/java/
> org/apache/nutch/scoring/AbstractScoringFilter.java#L40
> 
> I think this is supposed to be a bug. Please correct me if I am wrong. I
> can also submit a PR if needed.
> 
> Thanks,
> Yongyao
> 
> On Tue, Apr 18, 2017 at 1:13 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi,
>>
>> the scores in the index is not relevant for generating, only the scores in
>> CrawlDb.
>> The ScoringFilter interface defines a method indexerScore(...), some
>> scoring filters
>> return a modified (normalized) indexer score (cf. indexer.score.power).
>> Also, changes to
>> generate.min.score affect only which pages are fetched, pages fetched
>> before may have a lower score.
>> The score may also change when a page is processed (parsed, etc.) or even
>> afterwards
>> (by links pointing to it).
>>
>> In short: generate.min.score determines what is crawled, not what is
>> indexed.
>>
>> Best,
>> Sebastian
>>
>> On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
>>> Hi,
>>>
>>> I am using scoring-similarity plugin. After setting the
>> generate.min.score
>>> to 0.05, and indexing all the pages (with its score) into Elastic, I can
>>> still observe many web pages whose scores are below 0.05.
>>>
>>> <property>
>>>   <name>generate.min.score</name>
>>>   <value>0.05</value>
>>>   <description>Select only entries with a score larger than
>>>   generate.min.score.</description>
>>> </property>
>>>
>>> Below is the result of a simple aggregation of "score" in ES,
>>>         {
>>>                "key": "20170417215917",
>>>                "doc_count": 200,
>>>                "Stats": {
>>>                   "count": 200,
>>>                   "min": 0,
>>>                   "max": 0.019184709,
>>>                   "avg": 0.0012828724450000002,
>>>                   "sum": 0.256574489
>>>                }
>>>             }
>>>
>>> Thanks,
>>> Yongyao
>>>
>>
>>
> 
> 

Reply via email to