Hi,
the scores in the index is not relevant for generating, only the scores in
CrawlDb.
The ScoringFilter interface defines a method indexerScore(...), some scoring
filters
return a modified (normalized) indexer score (cf. indexer.score.power). Also,
changes to
generate.min.score affect only which pages are fetched, pages fetched before
may have a lower score.
The score may also change when a page is processed (parsed, etc.) or even
afterwards
(by links pointing to it).
In short: generate.min.score determines what is crawled, not what is indexed.
Best,
Sebastian
On 04/18/2017 12:31 AM, Yongyao Jiang wrote:
> Hi,
>
> I am using scoring-similarity plugin. After setting the generate.min.score
> to 0.05, and indexing all the pages (with its score) into Elastic, I can
> still observe many web pages whose scores are below 0.05.
>
> <property>
> <name>generate.min.score</name>
> <value>0.05</value>
> <description>Select only entries with a score larger than
> generate.min.score.</description>
> </property>
>
> Below is the result of a simple aggregation of "score" in ES,
> {
> "key": "20170417215917",
> "doc_count": 200,
> "Stats": {
> "count": 200,
> "min": 0,
> "max": 0.019184709,
> "avg": 0.0012828724450000002,
> "sum": 0.256574489
> }
> }
>
> Thanks,
> Yongyao
>